Date: Wed, 6 Mar 2013 19:58:38 +0100 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: NetNTLMv1 and MSCHAPv2 On 6 Mar, 2013, at 3:16 , Solar Designer <solar@...nwall.com> wrote: > I am replying without looking at the code, so it is possible that I > recall something incorrectly. > > On Wed, Mar 06, 2013 at 02:36:31AM +0100, magnum wrote: >> The SIMD code path already separates nthash from crypt_key just for the sake of postponing stuff until needed. That is, we only copy the 2 hot bytes and then don't touch nthash until we reach cmp_one() [a.k.a thorough part of cmp_all()]. We currently do not take the opportunity to reduce size of crypt_key - the latter could be just the 2 hot bytes. I just tried this but it makes no difference on its own (actually slightly slower - wtf?). > > I don't know why you got a slowdown, but in general for optimal cache > usage we need to keep the size of hot arrays to a minimum (in bytes) - > don't have hot and cold data intermixed. When the hot array's elements > are smaller, we may increase max_key_per_crypt while still fitting in L1 > data cache. This may provide some speedup. (On the other hand, we > should not bring max_key_per_crypt too close to the bitmap's size, which > is fixed at 64K bits, as that would make the bitmap ineffective.) The slowdown was very slight - probably just the coincidental variations we see at times - so it was probably more a lack of speedup. I now raised BLOCK_LOOPS to 8x for making use of the freed cache space and got a gain of >20% (now over 10G c/s for many salts, one core). For one salt, the speed is more or less the same as before. >> On 7 Feb, 2013, at 5:02 , Solar Designer <solar@...nwall.com> wrote: >>> Taking this a step further, we could store just a few bytes of the >>> 14-byte portion, and recompute the rest of the NT hash in cmp_exact() >>> when we have to. >> >> This might do more good for scalar code path than for SIMD. > > Too much to recompute with SIMD? With scalar code, we'd recompute 1 > hash per 64K; with SIMD we'd recompute e.g. 12 per 64K (if we have no > scalar NTLM code in that build). This is still a negligible percentage, > so performance will depend on other factors (changes in memory layout, > etc.) I agree that it feels wasteful to compute e.g. 12 hashes when we > only need 1. I did not really care about the waste, it was mostly a gut feeling. I've now tried this for the scalar code. No matter how I tweaked it, I got a performance regression of a couple percents. It did not matter if I stored the 7 bytes needed for cmp_one() or not. I reverted to my previous change, which is more like the SIMD case: the full NT hashes are stored in a separate (cold) array. >> OTOH maybe it could make SIMD scale better in OMP? > > Hardly. I was thinking along the lines of "store 16K crypts (or whatever number maximizes bitmap efficiency) as lean as possible". But with the poor results from the scalar experiment I did not even bother testing it. However, the larger BLOCK_LOOPS scales better than previous code (it's still not very good, 4xOMP is ~50% boost now compared to one core) so I think I'll enable OMP for SIMD by default now. By the way, are we not very vulnerable to false sharing now, with this tiny crypt_key array? Maybe I should have tried spreading it somehow. I'll commit more or less of this after some more testing. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.