Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 29 Jun 2011 15:27:20 -0500
From: "jfoug" <>
To: <>
Subject: RE: Lukas's Status Report - #7 of 15

>-----Original Message-----
>From: Łukasz Odzioba []
>This week:
>Try MSCash2 for nvidia cards.

Before you do mscash2, get with me on some significant optimizations, and
some extreme simplification.  The current mscash2 is doing 2x the work
needed inside the inner loop.   I have redone mscash2 in preparation for
converting it to SSE/intrinsic.  My main goal was to simplify the code
significantly, however, in doing so, by converting the inline code into oSSL
calls, I found the encryption of the ipad/opad block could very easily be
pulled out of the inner loop.  This results in a 2x improvement in speed,
over and above any previous optimizations found.

I would get it to you now, but I have it a bit ripped open right now,
waiting on Simon to try to help figure out why multi-block SHA1 on the
intrinsic SSE is not working like I thought it did (likely it is my usage,
but I cannot find it).


Here is my code for pbkdf2().  This is the 'entire' code. It is vastly
simpler than the original, which had calls to a HUGE hmac_sha1 function.
In this code, the salt_buffer is a simple 'flat' char * value (actually
converted into UTF16).  Salt_len is the length of this salt (in bytes, not
in UTF16 characters).  The _key[] value, is the original DCC1 value (the
mscache) for this password/user.  This same buffer is being used to return
the DCC2 value back to the caller.

Even though the changes I have to mscash2_fmt.c are not prime time ready, I
hope this function will help show what I am working on.

static void pbkdf2(unsigned int _key[]) // key is also 'final' digest.
	SHA_CTX ctx1, ctx2, tmp_ctx1, tmp_ctx2;
	unsigned char ipad[SHA_CBLOCK+1], opad[SHA_CBLOCK+1];
	unsigned int tmp_hash[SHA_DIGEST_LENGTH/4];
	unsigned i, j;
	unsigned char *key = (unsigned char*)_key;

	for(i = 0; i < 16; i++) {
		ipad[i] = key[i]^0x36;
		opad[i] = key[i]^0x5C;
	memset(&ipad[16], 0x36, sizeof(ipad)-16);
	memset(&opad[16], 0x5C, sizeof(opad)-16);


	SHA1_Update(&ctx1, ipad, SHA_CBLOCK);
	SHA1_Update(&ctx2, opad, SHA_CBLOCK);

	memcpy(&tmp_ctx1, &ctx1, sizeof(SHA_CTX));
	memcpy(&tmp_ctx2, &ctx2, sizeof(SHA_CTX));

	SHA1_Update(&ctx1, salt_buffer, salt_len);
	SHA1_Update(&ctx1, "\x0\x0\x0\x1", 4);
	SHA1_Final((unsigned char*)tmp_hash,&ctx1);

	SHA1_Update(&ctx2, (unsigned char*)tmp_hash, SHA_DIGEST_LENGTH);
	// we have to sha1 final to a 'temp' buffer, since we can only
overwrite first 16 bytes
	// of the _key buffer.  If we overwrote 20 bytes, then we would lose
the first 4 bytes
	// of the next element (and overwrite end of buffer on last
	SHA1_Final((unsigned char*)tmp_hash, &ctx2);

	// only copy first 16 bytes, since that is ALL this format uses
	memcpy(_key, tmp_hash,1 6);

	for(i = 1; i < 10240; i++)
		// we only need to copy the accumulator data from the CTX,
		// the original encryption was a full block of 64 bytes.
		memcpy(&ctx1, &tmp_ctx1, sizeof(SHA_CTX)-(64+sizeof(unsigned
		SHA1_Update(&ctx1, (unsigned char*)tmp_hash,
		SHA1_Final((unsigned char*)tmp_hash, &ctx1);

		memcpy(&ctx2, &tmp_ctx2, sizeof(SHA_CTX)-(64+sizeof(unsigned
		SHA1_Update(&ctx2, (unsigned char*)tmp_hash,
		SHA1_Final((unsigned char*)tmp_hash, &ctx2);

		// only xor first 16 bytes, since that is ALL this format
		for(j = 0; j < 4; j++)
			_key[j] ^= tmp_hash[j];

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ