Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 11 Jun 2015 10:11:13 +0800
From: Lei Zhang <>
Subject: Re: Interleaving of intrinsics

> On Jun 11, 2015, at 1:19 AM, magnum <> wrote:
> On 2015-06-10 17:59, Lei Zhang wrote:
>> I further did some investigation into the asm code generated under x1
>> & x2 (SIMD_PARA_SHA256) by icc on my laptop (AVX). In SSESHA256body,
>> there're about 200 vmovdqu instructions generated under x1, and the
>> number is 260 under x2. Most of the vmovdqu instructions seem to be
>> used for loading & storing xmm registers, only a few for
>> inter-register moving. I think it's likely those additional vmovdqu
>> instructions under x2 are for register spilling.
> So we get 30% more load/store for 100% more work done. That should be a win! But this assumes we're not having actual loops in the code.

I manually checked the report given by icc under interleaving x2. By checking the line number of the unrolled loops in the report, I can tell if a specific loop in the source is unrolled.

There're 13 instances of SHA256_PARA_DO in SSESHA256body. According to icc's report, 10 of them are fully unrolled. 

In addition, there're 64 instances of SHA256_STEP, which in turn invokes SHA256_PARA_DO. But none of them are unrolled according to the report. 

So there're in total 13 + 64 = 77 loops contributed by SHA256_PARA_DO, but only 10 of them are unrolled. That doesn't look good.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.