john-dev - Re: NetNTLMv1

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c4a39162ea29647d6727ef16953693a7@smtp.hushmail.com>
Date: Sun, 3 Feb 2013 03:01:54 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: NetNTLMv1

On 2 Feb, 2013, at 20:30 , magnum <john.magnum@...hmail.com> wrote:
> On 2 Feb, 2013, at 16:25 , Solar Designer <solar@...nwall.com> wrote:
>> On Fri, Feb 01, 2013 at 07:45:12AM +0400, Solar Designer wrote:
>>> With a generic+OpenMP build, it is ~3150M c/s for one process (8
>>> threads).  This puzzles me, because generic's MD4 computations are
>>> slower, whereas the comparisons are not supposed to be faster since
>>> OpenMP is only being made use of for the MD4s, not for comparisons, in
>>> that code version.  So I would have expected its performance to be
>>> around ~850M at "many salts" - same as I'm getting for one process with
>>> the XOP build (on otherwise idle system).  I don't understand where a
>>> further 4x speedup comes from.
>> 
>> I think I figured this out: generic+OpenMP uses much higher
>> max_keys_per_crypt than SIMD-enabled non-OpenMP builds do.  Can you
>> rework the latter to allow for increasing their max_keys_per_crypt?
>> My gut feeling is that a value of around 0x100 will be optimal (need to
>> make it a multiple of MMX_COEF and maybe MD4_SSE_PARA as appropriate for
>> a given build, of course).
> 
> Yes I figured I should try that. In NT2 there is a BLOCK_LOOPS macro that is a multiplier for SIMD number of keys. That was for OMP experiments but same code can be used for a single thread loop. BTW we can actually get up to ~80M for NT2 with 2xOMP but I haven't had any success in making it ready for production use: Only hardcoded values will work. As soon as I turn any of it into run-time variables, the overhead eats the gain. This can probably be worked out. And this should apply to NTLMv1 and MSCHAPv2 too.

Lol, I hit a luxury problem:

Benchmarking: NTLMv1 C/R MD4 DES (ESS MD5) [128/128 SSE2 intrinsics 12x]... DONE
Many salts:	4294M c/s real, 4294M c/s virtual
Only one salt:	38731K c/s real, 38348K c/s virtual

I can't tune BLOCK_LOOPS, because I hit the 32 bit limit and always see 4294M. I am pretty sure I fixed this for MPI, maybe we should always use the MPI version of that code in bench.c.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.