Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 5 Jun 2015 23:07:57 +0800
From: Lei Zhang <>
Subject: Re: Interleaving of intrinsics


I haven't got useful info from viewing the assembly yet. But I tried to collect some statistics using VTune.

Running PBKDF2-HMAC-SHA256 with various interleaving factors, OpenMP-disabled, on a Linux VM (Ivy Bridge):
Function					CPU Time
__memcpy_sse2_unaligned	0.094s
memcpy					0.080s
cfg_get_section			0.060s
pbkdf2_sha256_sse		0.036s
_mm_xor_si128			0.020s
[Others]					1.140s

Function					CPU Time
SSESHA256body			0.276s
cfg_get_section			0.042s
_mm_add_epi32			0.028s
pbkdf2_sha256_sse		0.028s
_mm_add_epi32			0.024s
[Others]					1.452s

Function					CPU Time
SSESHA256body			0.444s
__memcpy_sse2_unaligned	0.076s
cfg_get_section			0.062s
pbkdf2_sha256_sse		0.038s
_mm_srli_epi32			0.036s
[Others]					1.593s

Function					CPU Time
SSESHA256body			0.540s
__memcpy_sse2_unaligned	0.080s
pbkdf2_sha256_sse		0.036s
cfg_get_section			0.032s
_mm_and_si128			0.024s
[Others]					1.918s

'__memcpy_sse2_unaligned' might imply some overhead incurred from unaligned memcpy, which is irrelevant to this topic though. And longer CPU time for SSESHA256body should be the result of larger max_keys_per_crypt. This doesn't explain the performance variation, either.

Maybe you can gather some useful info from these figures? If necessary, I can do some more detailed profiling with VTune.


> On Jun 4, 2015, at 10:22 PM, magnum <> wrote:
> On 2015-06-02 13:01, Solar Designer wrote:
>> Would it be reasonable for us to try my usual approach, with separate
>> variables at the outer scope (inside the hashing function, but not
>> inside the individual steps)?  And if those are in fact separate
>> variables rather than array elements, this implies manual or cpp level
>> loop unrolling.
> I tried this out with MD5 and SHA256 in a topic branch. It doesn't seem to make any difference compared to loops and arrays.
> Also, other tests (before that) indicate per-line loops vs. block loops for interleaving does not make any difference either, at least not for gcc. Perhaps it does for icc (as tested on super), but all results are so fluctuating and inconclusive I just get more confused the more I test. Perhaps turbo boost and stuff are playing up.
> Perhaps Lei can make some conclusions from generated asm code. I think that's the only way of telling what actually happens.
> Maybe we under-estimate the compilers. I'm starting to think MD4 and MD5 interleaves fine poorly coded or not, while SHA1/SHA2 formats simply does not interleave well regardless of coding. If that's the case it would be a relief in a way: We could just keep the readable and straight-forward code...
> magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.