john-dev - Re: Re: minor raw-sha1-ng pull request

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ffb9f1549895ceba9e473299ddfe1d79@smtp.hushmail.com>
Date: Fri, 19 Apr 2013 19:22:19 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Re: minor raw-sha1-ng pull request

On 19 Apr, 2013, at 19:12 , magnum <john.magnum@...hmail.com> wrote:
>> $ ../run/john -test -fo=raw-sha1-ng
>> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics
>> 4x]...(8xOMP) DONE
>> Raw:	23232K c/s real, 3338K c/s virtual
>> 
>> I don't understand why real is so different than virtual, compared to
>> without omp:
>> 
>> $ ../run/john -test -fo=raw-sha1-ng
>> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics 4x]... DONE
>> Raw:	9251K c/s real, 9251K c/s virtual
>> 
>> What am I doing wrong? (I already batch crypts, so I figured I could just
>> split the work across threads if available, maybe this was naive).
> 
> This is expected. The raw figures are hashes/wall-clock-time and the virtual ones are hashes/CPU-time. If you could get it to scale well, the virtual figure would be near a non-OMP one.
> 
> So for 8x OMP you only get ~2.5x speed. As long as you don't get lower speeds than for one core, we can commit it for sure. I think you need to run much larger batches under OMP (OMP_SCALE in rawSHA256_ng_fmt.c) for hiding the overhead. I got nt2 to scale fairly well on Intel with an OMP_SCALE of 1536. That is, it runs 1536*MMX_COEF*MD4_PARA crypts per call, per core. Or put another way, the for loop will submit 1536 normal batches to each thread.

OK, I get this for non-OMP build:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... DONE
Raw:	23784K c/s real, 23784K c/s virtual

And this for OMP-build but running 1 core:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... DONE
Raw:	23553K c/s real, 23553K c/s virtual

That's fine. But trying to use more cores does not work well:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... (4xOMP) DONE
Raw:	16872K c/s real, 9373K c/s virtual

I see you already have SHA1_PARALLEL_HASH of 512. Look at init() in raw-sha256-ng and try to mimic that - you probable want to use an OMP_SCALE of 3 and the number of keys would be actual number of cores in use * OMP_SCALE * SHA1_PARALLEL_HASH. I bet this will give much better results. But this means you need to dynamically allocate the buffers.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.