john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150622180309.GB17277@openwall.com>
Date: Mon, 22 Jun 2015 21:03:09 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

On Mon, Jun 22, 2015 at 09:31:14PM +0800, Lei Zhang wrote:
> Below are the latest results from benchmarking interleaving factors. We're using all PBKDF2-HMAC formats here, and highlighted values are the highest in their rows.
> 
> magnum's laptop (OS X), gcc 5.1.0
> 
> hash\para  |       1  |       2  |       3  |       4  |       5  |
> -----------|----------|----------|----------|----------|----------|
> md4        |   18836  |   29861  |   33100  | **33424**|   30780  |
> md4-omp    |   79520  |  120128  |**121920**|  120192  |  112000  |
> md5        |   13532  |   21584  | **24420**|   22976  |   21465  |
> md5-omp    |   60736  |   86400  | **87920**|   84352  |   79360  |
> sha1       |   10736  | **10952**|    8928  |    4032  |    3740  |
> sha1-omp   | **41312**|   39744  |   34176  |   19968  |   19840  |
> sha256     |  **4664**|    2384  |    3516  |    3952  |    4120  |
> sha256-omp | **16736**|   10560  |   13782  |   15207  |   14891  |
> sha512     |  **1881**|     839  |    1290  |    1512  |    1524  |
> sha512-omp |  **6848**|    3808  |    4800  |    5639  |    5386  |

I think the -omp speeds here don't matter much, except possibly for
SHA-512.  Efficiency at OpenMP for these fast hashes is low.  What would
matter more is cumulative or per-process speed with --fork.

> MIC, icc 14.0.0
> 
> hash\para  |       1  |       2  |       3  |       4  |       5  |
> -----------|----------|----------|----------|----------|----------|
> md4        |    5687  |  **6526**|    6510  |    6209  |    6196  |
> md4-omp    |  669148  |**737882**|  711529  |  662588  |  466019  |
> md5        |    4182  |    4942  |    5037  |    5005  |  **5048**|
> md5-omp    |  520871  |**536854**|  513267  |  462291  |  447378  |
> sha1       |  **2598**|    2321  |    1411  |    1415  |    1346  |
> sha1-omp   |**282352**|  253514  |  180705  |  173886  |  163018  |
> sha256     |  **1077**|     855  |     830  |     887  |     880  |
> sha256-omp |**119300**|   97882  |   96000  |   98642  |   97627  |
> sha512     |     123  |     137  |     154  |     165  |   **172**|
> sha512-omp |   15567  |   17614  |   19525  |   20389  | **21333**|

Are all of those speeds consistently in thousands c/s?  If so, I don't
understand how we may possibly achieve e.g. 737882 thousand(?) c/s, thus
almost 738 million c/s, on MIC with our current approach at OpenMP
parallelization.  We surely should in fact achieve such speed, and even
higher than that, with proper OpenMP parallelization, but we don't have
proper OpenMP parallelization for fast hashes yet.

Is this possibly just 737882 c/s?  Thus, almost 10x _slower_ than the
single-thread speed reported on the line above?  If so, this is
realistic... unfortunately.  But it is also totally irrelevant to the
interleaving task.  This makes me wonder why you bothered to benchmark
it and record those numbers at all?

--fork=240 speeds would actually make sense here.  -omp do not - these
hashes are way too fast for sane efficiency at OpenMP on MIC, with our
current poor approach.  Unless I am missing something?

> As stated in my previous messages, the '*_PARA_DO' stuffs used prevalently for interleaving aren't always unrolled as expected. OTOH, when manually unrolling those '_PARA_DO's, the resulting code gets significant higher register pressure, and runs slower (on x86). 
> 
> We've be stalling on this issue for a while. Should we refine the method of interleaving or just stay in the current approach? What to do next?

It is difficult for me to provide advice on this without actually diving
into the task myself, and essentially replacing you.  I'd be reviewing
the generated assembly code, making changes, and reviewing the code again.
And indeed benchmarking, too.

One thing that is clear is that non-fully-unrolled *_PARA_DO are not
acceptable.  If there are not enough registers for fully unrolling
these without incurring spilling, then the interleaving factor should be
smaller.  On MIC, there should be enough registers for the interleaving
factors considered above (up to 5x).

Another thing that is clear is that you, Lei, need to have a better
understanding and feeling for what performance figures are sane vs.
insane.  And for when our current OpenMP parallelization makes sense vs.
does not.  For the hashes benchmarked above, it does not - it's just a
formality, a correctness test.  It makes sense for slow hashes, and does
not make sense for fast hashes.  For fast hashes, --fork=240 speeds
matter.  (And maybe eventually we'll rework our OpenMP parallelization
such that it'd achieve similar speeds at fast hashes too, or
alternatively maybe some particular fast hash formats will get builtin
mask mode candidate password generators and hash comparisons, like
Sayantan implemented for a few of them in OpenCL.)

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.