Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 11 Jun 2015 10:11:13 +0800
From: Lei Zhang <zhanglei.april@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics


> On Jun 11, 2015, at 1:19 AM, magnum <john.magnum@...hmail.com> wrote:
> 
> On 2015-06-10 17:59, Lei Zhang wrote:
>> I further did some investigation into the asm code generated under x1
>> & x2 (SIMD_PARA_SHA256) by icc on my laptop (AVX). In SSESHA256body,
>> there're about 200 vmovdqu instructions generated under x1, and the
>> number is 260 under x2. Most of the vmovdqu instructions seem to be
>> used for loading & storing xmm registers, only a few for
>> inter-register moving. I think it's likely those additional vmovdqu
>> instructions under x2 are for register spilling.
> 
> So we get 30% more load/store for 100% more work done. That should be a win! But this assumes we're not having actual loops in the code.

I manually checked the report given by icc under interleaving x2. By checking the line number of the unrolled loops in the report, I can tell if a specific loop in the source is unrolled.

There're 13 instances of SHA256_PARA_DO in SSESHA256body. According to icc's report, 10 of them are fully unrolled. 

In addition, there're 64 instances of SHA256_STEP, which in turn invokes SHA256_PARA_DO. But none of them are unrolled according to the report. 

So there're in total 13 + 64 = 77 loops contributed by SHA256_PARA_DO, but only 10 of them are unrolled. That doesn't look good.


Lei

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ