john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <5E8B5A16-03C5-41DD-8663-1E8A2C2E81F8@gmail.com>
Date: Thu, 11 Jun 2015 21:58:44 +0800
From: Lei Zhang <zhanglei.april@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

> On Jun 11, 2015, at 3:30 PM, magnum <john.magnum@...hmail.com> wrote:
> 
> Now we're getting somewhere. What if you build the "unrolled" topic branch instead, using para 2 (I think I didn't add code for higher para yet). This will be manually unrolled. How many vmovdqu can you see in that? Do you see other differences compared to the bleeding code (at same para)?

The manually unrolled version generates significantly longer asm code, with ~8000 instructions in SSESHA256body. This number is ~5000 in the auto-unrolled version.

The number of vmovdqu is also a lot bigger, that is 1170. It's only 260 in the auto-unrolled version. The register pressure seems to be much heavier when loops are fully unrolled.

Then performance (pbkdf2-hmac-sha256, x2):

[auto-unrolled]
Raw:	235 c/s real, 235 c/s virtual
[fully-unrolled]
Raw:	133 c/s real, 133 c/s virtual

Specs: laptop, icc, OpenMP disabled, turboboost disabled

We can see the fully unrolled one is much slower. I think register pressure is playing a big role here.

Lei

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.