john-dev - Re: NetNTLMv1 and MSCHAPv2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7eb9f3ffa6489ca2b6f72279bda6557a@smtp.hushmail.com>
Date: Wed, 6 Mar 2013 19:58:38 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: NetNTLMv1 and MSCHAPv2

On 6 Mar, 2013, at 3:16 , Solar Designer <solar@...nwall.com> wrote:
> I am replying without looking at the code, so it is possible that I
> recall something incorrectly.
> 
> On Wed, Mar 06, 2013 at 02:36:31AM +0100, magnum wrote:
>> The SIMD code path already separates nthash[] from crypt_key[] just for the sake of postponing stuff until needed. That is, we only copy the 2 hot bytes and then don't touch nthash[] until we reach cmp_one() [a.k.a thorough part of cmp_all()]. We currently do not take the opportunity to reduce size of crypt_key[] - the latter could be just the 2 hot bytes. I just tried this but it makes no difference on its own (actually slightly slower - wtf?).
> 
> I don't know why you got a slowdown, but in general for optimal cache
> usage we need to keep the size of hot arrays to a minimum (in bytes) -
> don't have hot and cold data intermixed.  When the hot array's elements
> are smaller, we may increase max_key_per_crypt while still fitting in L1
> data cache.  This may provide some speedup.  (On the other hand, we
> should not bring max_key_per_crypt too close to the bitmap's size, which
> is fixed at 64K bits, as that would make the bitmap ineffective.)

The slowdown was very slight - probably just the coincidental variations we see at times - so it was probably more a lack of speedup. I now raised BLOCK_LOOPS to 8x for making use of the freed cache space and got a gain of >20% (now over 10G c/s for many salts, one core). For one salt, the speed is more or less the same as before.


>> On 7 Feb, 2013, at 5:02 , Solar Designer <solar@...nwall.com> wrote:
>>> Taking this a step further, we could store just a few bytes of the
>>> 14-byte portion, and recompute the rest of the NT hash in cmp_exact()
>>> when we have to.
>> 
>> This might do more good for scalar code path than for SIMD.
> 
> Too much to recompute with SIMD?  With scalar code, we'd recompute 1
> hash per 64K; with SIMD we'd recompute e.g. 12 per 64K (if we have no
> scalar NTLM code in that build).  This is still a negligible percentage,
> so performance will depend on other factors (changes in memory layout,
> etc.)  I agree that it feels wasteful to compute e.g. 12 hashes when we
> only need 1.

I did not really care about the waste, it was mostly a gut feeling. I've now tried this for the scalar code. No matter how I tweaked it, I got a performance regression of a couple percents. It did not matter if I stored the 7 bytes needed for cmp_one() or not. I reverted to my previous change, which is more like the SIMD case: the full NT hashes are stored in a separate (cold) array.


>> OTOH maybe it could make SIMD scale better in OMP?
> 
> Hardly.

I was thinking along the lines of "store 16K crypts (or whatever number maximizes bitmap efficiency) as lean as possible". But with the poor results from the scalar experiment I did not even bother testing it. However, the larger BLOCK_LOOPS scales better than previous code (it's still not very good, 4xOMP is ~50% boost now compared to one core) so I think I'll enable OMP for SIMD by default now.

By the way, are we not very vulnerable to false sharing now, with this tiny crypt_key array? Maybe I should have tried spreading it somehow.

I'll commit more or less of this after some more testing.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.