Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 6 Mar 2013 19:58:38 +0100
From: magnum <>
Subject: Re: NetNTLMv1 and MSCHAPv2

On 6 Mar, 2013, at 3:16 , Solar Designer <> wrote:
> I am replying without looking at the code, so it is possible that I
> recall something incorrectly.
> On Wed, Mar 06, 2013 at 02:36:31AM +0100, magnum wrote:
>> The SIMD code path already separates nthash[] from crypt_key[] just for the sake of postponing stuff until needed. That is, we only copy the 2 hot bytes and then don't touch nthash[] until we reach cmp_one() [a.k.a thorough part of cmp_all()]. We currently do not take the opportunity to reduce size of crypt_key[] - the latter could be just the 2 hot bytes. I just tried this but it makes no difference on its own (actually slightly slower - wtf?).
> I don't know why you got a slowdown, but in general for optimal cache
> usage we need to keep the size of hot arrays to a minimum (in bytes) -
> don't have hot and cold data intermixed.  When the hot array's elements
> are smaller, we may increase max_key_per_crypt while still fitting in L1
> data cache.  This may provide some speedup.  (On the other hand, we
> should not bring max_key_per_crypt too close to the bitmap's size, which
> is fixed at 64K bits, as that would make the bitmap ineffective.)

The slowdown was very slight - probably just the coincidental variations we see at times - so it was probably more a lack of speedup. I now raised BLOCK_LOOPS to 8x for making use of the freed cache space and got a gain of >20% (now over 10G c/s for many salts, one core). For one salt, the speed is more or less the same as before.

>> On 7 Feb, 2013, at 5:02 , Solar Designer <> wrote:
>>> Taking this a step further, we could store just a few bytes of the
>>> 14-byte portion, and recompute the rest of the NT hash in cmp_exact()
>>> when we have to.
>> This might do more good for scalar code path than for SIMD.
> Too much to recompute with SIMD?  With scalar code, we'd recompute 1
> hash per 64K; with SIMD we'd recompute e.g. 12 per 64K (if we have no
> scalar NTLM code in that build).  This is still a negligible percentage,
> so performance will depend on other factors (changes in memory layout,
> etc.)  I agree that it feels wasteful to compute e.g. 12 hashes when we
> only need 1.

I did not really care about the waste, it was mostly a gut feeling. I've now tried this for the scalar code. No matter how I tweaked it, I got a performance regression of a couple percents. It did not matter if I stored the 7 bytes needed for cmp_one() or not. I reverted to my previous change, which is more like the SIMD case: the full NT hashes are stored in a separate (cold) array.

>> OTOH maybe it could make SIMD scale better in OMP?
> Hardly.

I was thinking along the lines of "store 16K crypts (or whatever number maximizes bitmap efficiency) as lean as possible". But with the poor results from the scalar experiment I did not even bother testing it. However, the larger BLOCK_LOOPS scales better than previous code (it's still not very good, 4xOMP is ~50% boost now compared to one core) so I think I'll enable OMP for SIMD by default now.

By the way, are we not very vulnerable to false sharing now, with this tiny crypt_key array? Maybe I should have tried spreading it somehow.

I'll commit more or less of this after some more testing.


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ