Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 31 May 2015 11:19:57 +0200
From: magnum <>
Subject: Re: Interleaving of intrinsics

On 2015-05-30 04:55, Solar Designer wrote:
> On Fri, May 29, 2015 at 10:52:41PM +0200, magnum wrote:
>> Here's a GitHub issue where we discuss interleaving and present some
>> benchmarks:

> magnum, what exactly did you benchmark for these tables? -
> I fail to make sense of them, except that they seem to show interleaving
> not helping.  It's unclear what hash type this is - raw or something
> else.  Are the 8x, 14x, and 6x the numbers of OpenMP threads?  If so,
> why 14x?  Was max_keys_per_crypt kept constant or was it increasing along
> with increased interleaving (if so, we might have been exceeding L1 data
> cache size for higher interleaving factors)?

It's pbkdf2-hmac-sha2 (using now in tree but subject to 
change). The 14x was on a 16xHT machine which wasn't idling, it had a 
pretty static 100% load on one core. I know 14x isn't a perfect test but 
it should be consistent.

The sha256 format has OMP_SCALE_4, perhaps I should re-test with 1.

> These are reasonable results for pbkdf2-hmac-sha512, but there's
> something "wrong" for pbkdf2-hmac-sha256.  It is suspicious whenever
> there's a performance regression for going from x1 to x2 interleaving,
> but then a performance improvement at higher interleaving factors.  This
> suggest there's some overhead incurred with interleaving, and it is
> probably avoidable.

Perhaps the para loops doesn't always unroll in sha256 and we end up 
with actual loops, as discussed below? The overhead would be less 
significant for higher paras.

>> I think you will have some educated thoughts about this; Here's part of
>> our current SHA-1:
>> (...)
>> This file is -O3 (from a pragma) so I guess both cases will be unrolled
>> but there is obviously a big difference after just unrolling. Assuming a
>> perfect optimizer it wouldn't matter but assuming a non-perfect one, is
>> the former better? I'm guessing SHA-1 was written that way for a reason?
> I wasn't involved in writing either.  I think JimF was first to
> introduce the *_PARA_DO stuff to JtR.

FWIW I'm pretty sure Simon did that the first ones.

> To me, both of these depend on compiler optimization too much.  When I
> interleave stuff, I use non-array temporary variables and explicit
> unrolling via macros - see BF_std.c, MD5_std.c, and php_mt_seed.  I have
> little experience with how well or not the approach with arrays and
> loops works across different compilers/versions - I only benchmarked
> these formats in JtR just like others in here did.
> <lots snipped>

Thanks, lots of food for thought. Lei, Are you with us here?

> magnum wrote:
>> One difference is that the former is almost guaranteed to be fully
>> unrolled why the latter might not.
> I am not so sure.  I think these might be at similar risk of not being
> unrolled.

But the SHA1 version basically puts a loop around single instructions. 
I'd be angry with gcc if it doesn't unroll that at -O3. But anyway we 
should try to establish how it ends up and why.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.