john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0c16e5e1d42fe4f4e7ca3736f387e70a@smtp.hushmail.com>
Date: Sun, 31 May 2015 11:19:57 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

On 2015-05-30 04:55, Solar Designer wrote:
> On Fri, May 29, 2015 at 10:52:41PM +0200, magnum wrote:
>> Here's a GitHub issue where we discuss interleaving and present some
>> benchmarks:
>> https://github.com/magnumripper/JohnTheRipper/issues/1217

> magnum, what exactly did you benchmark for these tables? -
>
> https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106695709
>
> I fail to make sense of them, except that they seem to show interleaving
> not helping.  It's unclear what hash type this is - raw or something
> else.  Are the 8x, 14x, and 6x the numbers of OpenMP threads?  If so,
> why 14x?  Was max_keys_per_crypt kept constant or was it increasing along
> with increased interleaving (if so, we might have been exceeding L1 data
> cache size for higher interleaving factors)?

It's pbkdf2-hmac-sha2 (using testparas.sh now in tree but subject to 
change). The 14x was on a 16xHT machine which wasn't idling, it had a 
pretty static 100% load on one core. I know 14x isn't a perfect test but 
it should be consistent.

The sha256 format has OMP_SCALE_4, perhaps I should re-test with 1.


> https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106839338
>
> These are reasonable results for pbkdf2-hmac-sha512, but there's
> something "wrong" for pbkdf2-hmac-sha256.  It is suspicious whenever
> there's a performance regression for going from x1 to x2 interleaving,
> but then a performance improvement at higher interleaving factors.  This
> suggest there's some overhead incurred with interleaving, and it is
> probably avoidable.

Perhaps the para loops doesn't always unroll in sha256 and we end up 
with actual loops, as discussed below? The overhead would be less 
significant for higher paras.


>> I think you will have some educated thoughts about this; Here's part of
>> our current SHA-1:
>> (...)
>> This file is -O3 (from a pragma) so I guess both cases will be unrolled
>> but there is obviously a big difference after just unrolling. Assuming a
>> perfect optimizer it wouldn't matter but assuming a non-perfect one, is
>> the former better? I'm guessing SHA-1 was written that way for a reason?
>
> I wasn't involved in writing either.  I think JimF was first to
> introduce the *_PARA_DO stuff to JtR.

FWIW I'm pretty sure Simon did that the first ones.


> To me, both of these depend on compiler optimization too much.  When I
> interleave stuff, I use non-array temporary variables and explicit
> unrolling via macros - see BF_std.c, MD5_std.c, and php_mt_seed.  I have
> little experience with how well or not the approach with arrays and
> loops works across different compilers/versions - I only benchmarked
> these formats in JtR just like others in here did.
> <lots snipped>

Thanks, lots of food for thought. Lei, Are you with us here?


> magnum wrote:
>> One difference is that the former is almost guaranteed to be fully
>> unrolled why the latter might not.
>
> I am not so sure.  I think these might be at similar risk of not being
> unrolled.

But the SHA1 version basically puts a loop around single instructions. 
I'd be angry with gcc if it doesn't unroll that at -O3. But anyway we 
should try to establish how it ends up and why.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.