Date: Mon, 17 Aug 2015 22:11:40 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Formats using non-SIMD SHA2 implementations > On Aug 17, 2015, at 11:07 AM, Lei Zhang <zhanglei.april@...il.com> wrote: > > On Aug 14, 2015, at 5:31 PM, magnum <john.magnum@...hmail.com> wrote: >> >> On 2015-08-14 04:35, Lei Zhang wrote: >>> >>> >>> I traced the execution of 7z's encryption: the size the hashed message could be really big, far beyond even 4 SHA2 input blocks. I think it's not possible to do the hashing with a single call to SIMDSHA256body(). >>> >>> Is there a way to repeatedly invoking SIMDSHA256body() just like SHA256_Update()? >> >> Sure, you just have to do the job yourself. Last (or single) block is max 55 bytes of input, all other can be 64 bytes. >> >> Say you need to do 189 bytes. You take the first 64 bytes (no 0x80, no length) and call SIMDSHA256body(). Then next 64 bytes and call it again. Now you have 61 bytes left. You put them in the buffer, add a 0x80 and zero the rest. And call SIMDSHA256body() again. Finally, in this case, you take a block of all zeros, just add the length (189*3) and make a final call. >> >> The problem is when you have different length input in one vector. Say one of them required 4 limbs, and another just 3 and the rest only one. This is doable (we do in eg. SAP F/G) but tedious - and reduces benefit of SIMD much like diverging threads in OpenCL does. So we usually don't do SIMD with such formats. > > I finally got 7z to work correctly with SIMD :) > > On a AVX2 machine, with OpenMP disabled: > > [without SIMD] > Benchmarking: 7z, 7-Zip (512K iterations) [SHA256 AES 32/64]... DONE > Speed for cost 1 (iteration count) of 524288 > Raw: 14.5 c/s real, 14.5 c/s virtual > > [with SIMD] > Benchmarking: 7z, 7-Zip (512K iterations) [SHA256 AES 32/64]... DONE > Speed for cost 1 (iteration count) of 524288 > Raw: 41.0 c/s real, 41.0 c/s virtual There's a minor issue. In sevenzip_kdf(), I need a relatively big array to buffer the message (Jim recommended buffering it on-fly with a small array, but I find it sort of hard to implement...). The array is too big to stay on stack, so I declared it as a static local variable. This is apparently problematic for a OpenMP build. An easy solution is to declare this array as threadprivate, but I remember Solar disapproved the use of this keyword. To make it work with OpenMP, I move this array to global scope, and make it large enough (dynamic allocated) to be shared by multi-threads. However, I noticed a minor degradation in performance after this change, from ~41 c/s to ~39 c/s for a non-OpenMP build. This change is not huge, but can be observed consistently. I doubt if this performance change is related to different addressing mode (static array vs dynamic array). Lei
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.