Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Sat, 30 May 2015 23:06:05 -0400
From: Alain Espinosa <alainesp@...ta.cu>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice SHA-256



-------- Original message --------
From: Alain Espinosa <alainesp@...ta.cu> 
Date:05/29/2015 2:13 PM (GMT-05:00) 
To: john-dev@...ts.openwall.com 
Cc: 
Subject: Re: [john-dev] bitslice SHA-256 

...I begin the hand-crafted assembler bitslice version, but was interrupted. Probably tonight I finish it and we will have a fair real performance comparison (not a theoretical one).

Just done. Attached new version of code. Included is hand-crafted Microsoft MASM assembler code for bitslice AVX2 sha256. Benchmark configuration is Windows 8.1, Visual Studio 2013, Core i5-4670 3.4GHz, only one thread. Performance is given in millions of keys per second:

- 23.7 : Normal   SHA256 implemented with hand-crafted AVX2 assembly
- 19.5 : Bitslice SHA256 implemented with hand-crafted AVX2 assembly (22% slower than normal, 56% faster than intrinsics)
- 12.5 : Bitslice SHA256 implemented with AVX2 intrinsics  

To give an idea of how this may traduce into real-world cracking speed, Hash Suite 3.3 cracks Raw-SHA256 at 24.8M. The assembly output object file sizes 23KB, with is near the limit of the L1 cache. I use 2 main functions ("bs_sha256_step_avx2" with 72% execution time and "bs_sha256_RW_avx2" with 25%), similar to the intrinsics code. Trying to unroll all calls grows the assembly output object file surpassing 256KB and performance drops.

Theoretical performance we can expect on other architectures (please see "bs_sha256.c" for details):

- (79-62)/62 = 27%  slower than normal SHA256 in x86_64   (5 instructions to sum)
- (58-55)/55 = 5.5% faster than normal SHA256 in Neon/XOP (3 instructions to sum)
- (53-40)/40 = 33%  faster than normal SHA256 in AVX512   (2 instructions to sum)

Search for "jae zero_const; ONES" in the asm file for an interesting way to reduce instructions when adding a constant. I just "brute-force" (program to test all possibilities) the calculation of the carry, trying to reduce it from 3 instructions (hoping ANDN give some advantages). No luck. I will try to find some other way to optimize adds and reduce instructions count. The scheduling of instructions can be improved, but i think AVX2 is not the better instructions-set for bitslice with adds.

Regards, 
Alain
[ CONTENT OF TYPE text/html SKIPPED ]

[ CONTENT OF TYPE application/zip SKIPPED ]

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ