Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 6 Mar 2013 06:16:03 +0400
From: Solar Designer <>
Subject: Re: NetNTLMv1 and MSCHAPv2

magnum -

I am replying without looking at the code, so it is possible that I
recall something incorrectly.

On Wed, Mar 06, 2013 at 02:36:31AM +0100, magnum wrote:
> The SIMD code path already separates nthash[] from crypt_key[] just for the sake of postponing stuff until needed. That is, we only copy the 2 hot bytes and then don't touch nthash[] until we reach cmp_one() [a.k.a thorough part of cmp_all()]. We currently do not take the opportunity to reduce size of crypt_key[] - the latter could be just the 2 hot bytes. I just tried this but it makes no difference on its own (actually slightly slower - wtf?).

I don't know why you got a slowdown, but in general for optimal cache
usage we need to keep the size of hot arrays to a minimum (in bytes) -
don't have hot and cold data intermixed.  When the hot array's elements
are smaller, we may increase max_key_per_crypt while still fitting in L1
data cache.  This may provide some speedup.  (On the other hand, we
should not bring max_key_per_crypt too close to the bitmap's size, which
is fixed at 64K bits, as that would make the bitmap ineffective.)

> On 7 Feb, 2013, at 5:02 , Solar Designer <> wrote:
> > Taking this a step further, we could store just a few bytes of the
> > 14-byte portion, and recompute the rest of the NT hash in cmp_exact()
> > when we have to.
> This might do more good for scalar code path than for SIMD.

Too much to recompute with SIMD?  With scalar code, we'd recompute 1
hash per 64K; with SIMD we'd recompute e.g. 12 per 64K (if we have no
scalar NTLM code in that build).  This is still a negligible percentage,
so performance will depend on other factors (changes in memory layout,
etc.)  I agree that it feels wasteful to compute e.g. 12 hashes when we
only need 1.

> OTOH maybe it could make SIMD scale better in OMP?


> > ... and if we store e.g. just last 4 bytes of each NT hash, then we can
> > get away with using just one array, like we do now.  2 bytes of each
> > element will be directly compared to the known values for the 3rd DES
> > block key, and 2 bytes before those last 2 will be compared to results
> > of DES encryption of the 2nd block, which are computed as needed.
> > I think I like this option best.  Separating the hot/cold arrays has
> > some cost of its own, and we can avoid that by simply making each
> > element this small (only 4 bytes).  The index to offset calculation
> > becomes trivial (and supported in x86's addressing modes natively).
> Maybe I'm thick now but I don't follow. cmp_all() checks the known last two bytes of the NT hash. If they match (once in 64K), we calculate the first DES block and compare that. We need the first 7 bytes of the NT hash for this so 2+7 bytes would make it until cmp_exact(). How would we use just 2+2 bytes?

I think I was wrong, and you're right, but I'm not sure what exactly I
was thinking when I wrote the above a month ago. ;-)


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.