john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <6EE427BD-0BA0-47D0-9F8B-D3C01E814544@gmail.com>
Date: Tue, 14 Jul 2015 12:11:16 +0800
From: Lei Zhang <zhanglei.april@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics


> On Jun 23, 2015, at 2:03 AM, Solar Designer <solar@...nwall.com> wrote:
> 
> One thing that is clear is that non-fully-unrolled *_PARA_DO are not
> acceptable.  If there are not enough registers for fully unrolling
> these without incurring spilling, then the interleaving factor should be
> smaller.  On MIC, there should be enough registers for the interleaving
> factors considered above (up to 5x).

I just manually unrolled SHA256_STEP and SHA512_STEP respectively, and compared the performance with the auto-unrolled ones, using magnum's testpara.pl. The figures below are obtained on my laptop (formats are pbkdf2-*):

[auto]
hash\para  |       1  |       2  |       3  |       4  |       5  |
-----------|----------|----------|----------|----------|----------|
sha256     |  **4020**|    3760  |    3924  |    3801  |    3940  |
sha512     |  **1624**|    1092  |    1413  |    1409  |    1435  |

[manual]
hash\para  |       1  |       2  |       3  |       4  |       5  |
-----------|----------|----------|----------|----------|----------|
sha256     |  **4144**|    1888  |    1817  |    1837  |    1821  |
sha512     |  **1646**|     748  |     708  |     720  |     722  |

With manual unrolling, the performance degrades drastically from interleaving x1 to x2, but not so much upwards. BTW, I didn't change the original array tmps. Just the loop is manually unrolled here.


Lei

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.