Date: Thu, 11 Jun 2015 19:52:31 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics On 2015-06-11 15:58, Lei Zhang wrote: > >> On Jun 11, 2015, at 3:30 PM, magnum <john.magnum@...hmail.com> wrote: >> >> Now we're getting somewhere. What if you build the "unrolled" topic branch instead, using para 2 (I think I didn't add code for higher para yet). This will be manually unrolled. How many vmovdqu can you see in that? Do you see other differences compared to the bleeding code (at same para)? > > The manually unrolled version generates significantly longer asm code, with ~8000 instructions in SSESHA256body. This number is ~5000 in the auto-unrolled version. > > The number of vmovdqu is also a lot bigger, that is 1170. It's only 260 in the auto-unrolled version. The register pressure seems to be much heavier when loops are fully unrolled. > > Then performance (pbkdf2-hmac-sha256, x2): > > [auto-unrolled] > Raw: 235 c/s real, 235 c/s virtual > [fully-unrolled] > Raw: 133 c/s real, 133 c/s virtual > > Specs: laptop, icc, OpenMP disabled, turboboost disabled > > We can see the fully unrolled one is much slower. I think register pressure is playing a big role here. That version is manually interleaved x2, with no loop constructs and with non-array temp variables. We'll see what Solar says, but I presume this means we can just forget about interleaving SHA-2 on x86. And from what I've gathered I believe this also goes for SHA-1. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.