Date: Thu, 11 Jun 2015 21:58:44 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics > On Jun 11, 2015, at 3:30 PM, magnum <john.magnum@...hmail.com> wrote: > > Now we're getting somewhere. What if you build the "unrolled" topic branch instead, using para 2 (I think I didn't add code for higher para yet). This will be manually unrolled. How many vmovdqu can you see in that? Do you see other differences compared to the bleeding code (at same para)? The manually unrolled version generates significantly longer asm code, with ~8000 instructions in SSESHA256body. This number is ~5000 in the auto-unrolled version. The number of vmovdqu is also a lot bigger, that is 1170. It's only 260 in the auto-unrolled version. The register pressure seems to be much heavier when loops are fully unrolled. Then performance (pbkdf2-hmac-sha256, x2): [auto-unrolled] Raw: 235 c/s real, 235 c/s virtual [fully-unrolled] Raw: 133 c/s real, 133 c/s virtual Specs: laptop, icc, OpenMP disabled, turboboost disabled We can see the fully unrolled one is much slower. I think register pressure is playing a big role here. Lei
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.