Date: Fri, 5 Jun 2015 23:07:57 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics Hi, I haven't got useful info from viewing the assembly yet. But I tried to collect some statistics using VTune. Running PBKDF2-HMAC-SHA256 with various interleaving factors, OpenMP-disabled, on a Linux VM (Ivy Bridge): [x1] Function CPU Time __memcpy_sse2_unaligned 0.094s memcpy 0.080s cfg_get_section 0.060s pbkdf2_sha256_sse 0.036s _mm_xor_si128 0.020s [Others] 1.140s [x2] Function CPU Time SSESHA256body 0.276s cfg_get_section 0.042s _mm_add_epi32 0.028s pbkdf2_sha256_sse 0.028s _mm_add_epi32 0.024s [Others] 1.452s [x3] Function CPU Time SSESHA256body 0.444s __memcpy_sse2_unaligned 0.076s cfg_get_section 0.062s pbkdf2_sha256_sse 0.038s _mm_srli_epi32 0.036s [Others] 1.593s [x4] Function CPU Time SSESHA256body 0.540s __memcpy_sse2_unaligned 0.080s pbkdf2_sha256_sse 0.036s cfg_get_section 0.032s _mm_and_si128 0.024s [Others] 1.918s '__memcpy_sse2_unaligned' might imply some overhead incurred from unaligned memcpy, which is irrelevant to this topic though. And longer CPU time for SSESHA256body should be the result of larger max_keys_per_crypt. This doesn't explain the performance variation, either. Maybe you can gather some useful info from these figures? If necessary, I can do some more detailed profiling with VTune. Lei > On Jun 4, 2015, at 10:22 PM, magnum <john.magnum@...hmail.com> wrote: > > On 2015-06-02 13:01, Solar Designer wrote: >> Would it be reasonable for us to try my usual approach, with separate >> variables at the outer scope (inside the hashing function, but not >> inside the individual steps)? And if those are in fact separate >> variables rather than array elements, this implies manual or cpp level >> loop unrolling. > > I tried this out with MD5 and SHA256 in a topic branch. It doesn't seem to make any difference compared to loops and arrays. > > https://github.com/magnumripper/JohnTheRipper/commit/1ccc69541fef79c0f20f3143a2fcf3bedac55d30 > > Also, other tests (before that) indicate per-line loops vs. block loops for interleaving does not make any difference either, at least not for gcc. Perhaps it does for icc (as tested on super), but all results are so fluctuating and inconclusive I just get more confused the more I test. Perhaps turbo boost and stuff are playing up. > > Perhaps Lei can make some conclusions from generated asm code. I think that's the only way of telling what actually happens. > > Maybe we under-estimate the compilers. I'm starting to think MD4 and MD5 interleaves fine poorly coded or not, while SHA1/SHA2 formats simply does not interleave well regardless of coding. If that's the case it would be a relief in a way: We could just keep the readable and straight-forward code... > > magnum >
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.