Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 3 Feb 2013 03:01:54 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: NetNTLMv1

On 2 Feb, 2013, at 20:30 , magnum <john.magnum@...hmail.com> wrote:
> On 2 Feb, 2013, at 16:25 , Solar Designer <solar@...nwall.com> wrote:
>> On Fri, Feb 01, 2013 at 07:45:12AM +0400, Solar Designer wrote:
>>> With a generic+OpenMP build, it is ~3150M c/s for one process (8
>>> threads).  This puzzles me, because generic's MD4 computations are
>>> slower, whereas the comparisons are not supposed to be faster since
>>> OpenMP is only being made use of for the MD4s, not for comparisons, in
>>> that code version.  So I would have expected its performance to be
>>> around ~850M at "many salts" - same as I'm getting for one process with
>>> the XOP build (on otherwise idle system).  I don't understand where a
>>> further 4x speedup comes from.
>> 
>> I think I figured this out: generic+OpenMP uses much higher
>> max_keys_per_crypt than SIMD-enabled non-OpenMP builds do.  Can you
>> rework the latter to allow for increasing their max_keys_per_crypt?
>> My gut feeling is that a value of around 0x100 will be optimal (need to
>> make it a multiple of MMX_COEF and maybe MD4_SSE_PARA as appropriate for
>> a given build, of course).
> 
> Yes I figured I should try that. In NT2 there is a BLOCK_LOOPS macro that is a multiplier for SIMD number of keys. That was for OMP experiments but same code can be used for a single thread loop. BTW we can actually get up to ~80M for NT2 with 2xOMP but I haven't had any success in making it ready for production use: Only hardcoded values will work. As soon as I turn any of it into run-time variables, the overhead eats the gain. This can probably be worked out. And this should apply to NTLMv1 and MSCHAPv2 too.

Lol, I hit a luxury problem:

Benchmarking: NTLMv1 C/R MD4 DES (ESS MD5) [128/128 SSE2 intrinsics 12x]... DONE
Many salts:	4294M c/s real, 4294M c/s virtual
Only one salt:	38731K c/s real, 38348K c/s virtual

I can't tune BLOCK_LOOPS, because I hit the 32 bit limit and always see 4294M. I am pretty sure I fixed this for MPI, maybe we should always use the MPI version of that code in bench.c.

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.