Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 16 Mar 2015 04:35:30 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Xeon Phi

On Thu, Mar 05, 2015 at 07:47:40PM +0300, Solar Designer wrote:
> [user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=descrypt
> Will run 240 OpenMP threads
> Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE
> Many salts:     80170K c/s real, 333233 c/s virtual
> Only one salt:  7660K c/s real, 61684 c/s virtual
> 
> (This is on our 5110P.)

I've just improved this to:

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=descrypt
Will run 240 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE
Many salts:     84811K c/s real, 352128 c/s virtual
Only one salt:  7710K c/s real, 62106 c/s virtual

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te -form=descrypt
Will run 240 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE
Many salts:     85040K c/s real, 354418 c/s virtual
Only one salt:  7391K c/s real, 45700 c/s virtual

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te -form=descrypt
Will run 240 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE
Many salts:     85209K c/s real, 354606 c/s virtual
Only one salt:  7405K c/s real, 46525 c/s virtual

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te -form=descrypt
Will run 240 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE
Many salts:     85040K c/s real, 354976 c/s virtual
Only one salt:  7391K c/s real, 46985 c/s virtual

by adding the -no-opt-prefetch option to CFLAGS in the linux-mic make
target in the core tree.  magnum, please get this into jumbo as well.

The prefetch instructions appeared to hurt performance in this case.
I've also tried the settings from -opt-prefetch=1 to -opt-prefetch=4,
and all of them were slower than -no-opt-prefetch.

Once I chose -no-opt-prefetch, I tried re-tuning DES_bs_cpt and
DES_BS_EXPAND, but got inconclusive results, so I left them as-is.

-no-opt-prefetch also improves bcrypt speed from:

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te -form=bcrypt
Will run 240 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    6285 c/s real, 26.2 c/s virtual

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=bcrypt
Will run 240 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    6280 c/s real, 26.2 c/s virtual

to:

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te -form=bcrypt
Will run 240 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    6330 c/s real, 26.3 c/s virtual

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=bcrypt
Will run 240 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    6339 c/s real, 26.3 c/s virtual

> The icc-generated assembly code for DES_bs_b.c looks sane to me (about
> as good as what we're used to seeing for AVX, or maybe even better since
> there are 32 ZMM registers).

I noticed lots of vmovaps and vmovapd in the generated code.  I tried
replacing them with vmovdqa32 or vmovdqa64 using "sed -i" and building
from the modified .s file.  This didn't affect performance.  I also
tried switching from the *_epi32 to the *_epi64 macros (which actually
correspond to different instructions for bitwise ops where these are
supposed to produce the same results).  This also made no difference for
performance.  Not even when I consistently made all of the loads,
stores, and the bitwise instructions operate on vectors with 64-bit
elements.  So I reverted those changes for now.  (In fact, I never had a
C source code level change to produce the vmovdqa32 or vmovdqa64
instructions - I guess it'd take explicit load intrinsics, but even then
I'm not sure since the compiler appears to ignore our explicit
_mm512_store_epi32() and produce vmovaps anyway.  Luckily, this does not
matter on current hardware.)

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ