john-dev - Re: Re: minor raw-sha1-ng pull request

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4515a884f7d3907e6c1868f156c18f3e@smtp.hushmail.com>
Date: Fri, 19 Apr 2013 19:12:43 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Re: minor raw-sha1-ng pull request

On 19 Apr, 2013, at 16:59 , Tavis Ormandy <taviso@...xchg8b.com> wrote:
> magnum <john.magnum@...hmail.com> wrote:
>> On 19 Apr, 2013, at 6:33 , Tavis Ormandy
>> <taviso@...xchg8b.com> wrote:
>>> Hey Magnum, I noticed I could pull a bit of work out of the crypt inner
>>> loop and do it during set_key for cheaper. Also a small amount of work
>>> could be moved into crypt (to vectorize it). It's not a huge difference,
>>> but I think it's obviously correct.
>>> 
>>> https://github.com/magnumripper/JohnTheRipper/pull/280/files
>> 
>> This is merged.
> 
> Thanks! BTW, I saw your recent commit adding omp support to raw-sha256-ng, I
> tried to copy and paste your code, but I get this odd result:

We don't have OMP support for the fastest raw formats because it's very tricky to get any decent scaling. It can be done though. In nt2 there is support but it defaults to off. On some CPUs it's just detrimental, on others it works fine.

> $ ../run/john -test -fo=raw-sha1-ng
> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics
> 4x]...(8xOMP) DONE
> Raw:	23232K c/s real, 3338K c/s virtual
> 
> I don't understand why real is so different than virtual, compared to
> without omp:
> 
> $ ../run/john -test -fo=raw-sha1-ng
> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics 4x]... DONE
> Raw:	9251K c/s real, 9251K c/s virtual
> 
> What am I doing wrong? (I already batch crypts, so I figured I could just
> split the work across threads if available, maybe this was naive).

This is expected. The raw figures are hashes/wall-clock-time and the virtual ones are hashes/CPU-time. If you could get it to scale well, the virtual figure would be near a non-OMP one.

So for 8x OMP you only get ~2.5x speed. As long as you don't get lower speeds than for one core, we can commit it for sure. I think you need to run much larger batches under OMP (OMP_SCALE in rawSHA256_ng_fmt.c) for hiding the overhead. I got nt2 to scale fairly well on Intel with an OMP_SCALE of 1536. That is, it runs 1536*MMX_COEF*MD4_PARA crypts per call, per core. Or put another way, the for loop will submit 1536 normal batches to each thread.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.