Mysterious “Cryptosystem internal error” from MIT Kerberos library

The place where I work uses kerberos based authentication heavily.

So every server process that needs to accept incoming connections has a keytab installed and every process that needs to connect to server processes has pre-stashed kerberos tickets.
kerberos
To facilitate run time reconfiguration those server processes normally have text based admin interfaces and special kerberos based admin client is used to make sure authentication and authorization is done properly.

Things gets pushed so much that server processes needs to send special admin command to themselves as workaround for known issues, and we begin to use timer for this – and very naturally those server processes are client and server of themselves at the same time.

And recently we have significantly increases the timer based usage that on one timer more than ten admin commands needs to be issued at the same time, and to speed things up a bit we use multiple threads for this.

Then one day one of server crashed due to uncaught exception while acquire initial credential with the function call like

major=::gss_acquire_cred(minor_status, desired_name, time_req, desired_mechs, cred_usage, output_cred_handle, actual_mechs, time_rec);
if(GSS_ERROR(major))
{
  ::gss_display_status(&minor, minor_status, type_, GSS_C_NULL_OID, &ctx, &buffer);
  throw CredentialException(buffer);
}

our log shows the minor_status code translates to string “Cryptosystem internal error” but then nothing else.

It becomes so confusing as we don’t really know what “Cryptosystem internal error” means – after all, sounds quite internal, isn’t it?

We use quite an old version of MIT kerberos library – version 1.4.4 while latest version on the mit website is 1.13.
Obviously it’s not a viable option either for us to directly jump to that one.

The code is not particularly easy to read but we are force to this time.

To start, we know “Cryptosystem internal error” corresponds to KRB5_CRYPTO_INTERNAL status code so we only need to look at functions where this status code is returned from.

Also we noticed that the thread that returns error code is not the first thread and there are threads which has successfully acquired client credential at the same time, so we can safely ignore those functions that are run by k5_once (which calls pthread_once underneath).

GSSAPI is an API for accessing security services and various vendor libraries can choose to implement it. Most client applications simply dlopen the desired vendor library to use the implementation and can easily switch to another when needed.
(http://en.wikipedia.org/wiki/Generic_Security_Services_Application_Program_Interface)

MIT’s implementation provides a special glue file to glue GSSAPI function to the specific krb5 implementation.

For example, the gss aPI “gss_acquire_cred” we calls in our application translates to another one like below:

OM_uint32 KRB5_CALLCONV
gss_acquire_cred(minor_status, desired_name, time_req, desired_mechs,
     cred_usage, output_cred_handle, actual_mechs, time_rec)
     OM_uint32 *minor_status;
     gss_name_t desired_name;
     OM_uint32 time_req;
     gss_OID_set desired_mechs;
     gss_cred_usage_t cred_usage;
     gss_cred_id_t *output_cred_handle;
     gss_OID_set *actual_mechs;
     OM_uint32 *time_rec;
{
   return(krb5_gss_acquire_cred(minor_status,
        desired_name,
        time_req,
        desired_mechs,
        cred_usage,
        output_cred_handle,
        actual_mechs,
        time_rec));
}

Following from “krb5_gss_acquire_cred” there are many functions called either directly or indirectly, some of them go several levels deep.

Luckily after a bit of hunting in the code there is only really one place where this status code can be returned:

static krb5_error_code
init_common (krb5_context *context, krb5_boolean secure)
{
...
  /* initialize the prng (not well, but passable) */
  if ((retval = krb5_c_random_os_entropy( ctx, 0, NULL)) !=0)
    goto cleanup;
  if ((retval = krb5_crypto_us_timeofday(&seed_data.now, &seed_data.now_usec)))
    goto cleanup;
  seed_data.pid = getpid ();
  seed.length = sizeof(seed_data);
  seed.data = (char *) &seed_data;
  if ((retval = krb5_c_random_add_entropy(ctx, KRB5_C_RANDSOURCE_TIMING, &seed)))
    goto cleanup;
...

This part of code is for initializing the Pseudo Random Number Generator (PRNG) used internally in Kerberos.

This PRNG is based on Bruce Schneier’s Yarrow implementation.
Yarrow is the name of the plant that is used in ancient China for divination purposes following ways described in another well known book “I Ching”.
Achillea_millefolium_vallee-de-grace-amiens_80_22062007_1

And here goes the page from the author detailing how Yarrow is designed and implemented:
https://www.schneier.com/yarrow.html

So what the above quoted code does is actually adding entropy to the PRNG pool and the entropy is taken from the system time – of course that won’t be the only entropy employed – even though time seems quite random it is in fact highly predictable with current speed of processors.

The add entropy function finally calls into yarrow_input_maybe_locking() for adding the entropy into the internal pool with a lock requested with do_lock=1:

static
int yarrow_input_maybe_locking( Yarrow_CTX* y, unsigned source_id,
        const void* sample,
        size_t size, size_t entropy_bits,
        int do_lock )
{
    EXCEP_DECL;
    int ret;
    int locked = 0;
    Source* source;
    size_t new_entropy;
    size_t estimate;

    if (!y) { THROW( YARROW_BAD_ARG ); }

    if (source_id >= y->num_sources) { THROW( YARROW_BAD_SOURCE ); }

    source = &y->source[source_id];

    if(source->pool != YARROW_FAST_POOL && source->pool != YARROW_SLOW_POOL)
    {
  THROW( YARROW_BAD_SOURCE );
    }

    if (do_lock) {
      TRY( LOCK() );
      locked = 1;
    }

    /* hash in the sample */
...
   /* put samples in alternate pools */

    source->pool = (source->pool + 1) % 2;

 CATCH:
    if ( locked ) { TRY( UNLOCK() ); }
    EXCEP_RET;
}

As can be noticed the lock is not instantly obtained, obviously the author meant to squeeze some efficiency out of it by reducing the scope of code that requires locking.

So what is done without locking seems to be safe, after all it's all read operations and with those code running on 64 bit boxes we can be quite sure the operations are all atomic.
So even though another thread is modifying any field of the object we are still guaranteed to never read back any value in the intermediate states.

However, there is a logical glitch in the above code, check out this line:

    if(source->pool != YARROW_FAST_POOL && source->pool != YARROW_SLOW_POOL)
    {
  THROW( YARROW_BAD_SOURCE );
    }

While the intention seems to be clear it suffers from "ABA" issue.
http://en.wikipedia.org/wiki/ABA_problem

Since at the end of the function we are required to flip the source->pool there is a chance that while we make two reads of source->pool the value could have changed in between.
So even though first read reads YARROW_SLOW_POOL hence should match on second comparison, second read can read another value YARROW_FAST_POOL, hence even though the value before and after the change are both legal values it is taken as illegal and YARROW_BAD_SOURCE is thrown.

Of course compiler can add optimization and read the value once and cache it for next comparison.

here goes the gdb disassemble code on the relevant part:

(gdb) disassemble yarrow_input_maybe_locking
Dump of assembler code for function yarrow_input_maybe_locking:
  0x00002ba32164bd13 <+0>:   push %rbp
  0x00002ba32164bd14 <+1>:   mov %rsp, %rbp
...
  0x00002ba32164bdaa <+151>: lea (%rcx, %rax, 1), %rax
  0x00002ba32164bdae <+155>: mov %rax, -0x40(%rbp)
  0x00002ba32164bdb2 <+159>: mov -0x40(%rbp),%rax
  0x00002ba32164bdb6 <+163>: mov (%rax), %eax              ===> first fetch
  0x00002ba32164bdb8 <+165>: test %eax, %eax               ===> first comparison
  0x00002ba32164bdba <+167>: je 0x2ba32164bdd8 <yarrow_input_maybe_locking+197>
  0x00002ba32164bdbc <+169>: mov -0x40(%rbp), %rax         
  0x00002ba32164bdc0 <+173>: mov (%rax), %eax              ===> second fetch
  0x00002ba32164bdc2 <+175>: cmp $0x1, %eax                ===> second comparison
  0x00002ba32164bdc5 <+178>: je 0x2ba32164bdd8 <yarrow_input_maybe_locking+197>
  0x00002ba32164bdc7 <+180>: movl $0xfffffffb, -0x4c(%rbp) ===> return YARROW_BAD_SOURCE
...

So it's rather clear that we would suffer from this issue when multiple threads are trying to acquire the credential.

Realizing this is the bug and we went on to check mit kerberos code check in history and seems even though there is no clear detailing of any similar issue, it does get patched and right after our used version, i.e. in krb 1.5.

Here is the svn log for the change.

svn log yarrow.c
krb5-1.5-alpha1
------------------------------------------------------------------------
r17204 | raeburn | 2005-04-28 17:37:18 -0400 (Thu, 28 Apr 2005) | 6 lines

* yarrow.c: Delete old macintosh support.
(yarrow_input_maybe_locking): Do the optional locking, and verify that the
mutex is locked, before doing anything else.
(yarrow_reseed_locked): Verify that the global mutex is locked before doing
anything else.

And here goes the change:

$svn diff -r 17203:17204 yarrow.c
Index: yarrow.c
===================================================================
--- yarrow.c    (revision 17203)
+++ yarrow.c    (revision 17204)
@@ -26,11 +26,7 @@
 #include "port-sockets.h"
 #else
 #   include <unistd.h>
-#   if defined(macintosh)
-#       include <Memory.h>
-#   else
-#       include <netinet/in.h>
-#   endif
+#   include <netinet/in.h>
 #endif
 #if !defined(YARROW_NO_MATHLIB)
 #include <math.h>
@@ -262,23 +258,24 @@
     Source* source;
     size_t new_entropy;
     size_t estimate;
+
+    if (do_lock) {
+           TRY( LOCK() );
+           locked = 1;
+    }
+    k5_assert_locked(&krb5int_yarrow_lock);

     if (!y) { THROW( YARROW_BAD_ARG ); }

     if (source_id >= y->num_sources) { THROW( YARROW_BAD_SOURCE ); }

     source = &y->source[source_id];
-
+
     if(source->pool != YARROW_FAST_POOL && source->pool != YARROW_SLOW_POOL)
     {
        THROW( YARROW_BAD_SOURCE );
     }

-    if (do_lock) {
-           TRY( LOCK() );
-           locked = 1;
-    }
-
     /* hash in the sample */

...

We end up patching the broken 1.4 mit kerberos library by back-porting the fix.
And latest mit kerberos implementation has completely decomissed Yarrow and replaced with another one - this time the Goddess of Fortune and Fate: Fortuna

TomisFortuna2

About codywu2010

a programmer
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s