[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Sun, 19 Jul 2015 23:50:50 +0300
From: Aleksey Cherepanov <lyosha@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: faster way to transpose matrix of words for sse and bits for bitslice

It seems that transposition is a common operation in john: sse needs
to transpose matrix of integers and bitslice needs to transpose a
matrix of bits.

I found an algo to transpose bit matrices faster than bit by bit in
Hacker's Delight. I saw a similar algo in other places, for instance
for integers with sse.

http://www.hackersdelight.org/hdcodetxt/transpose32.c.txt

http://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html

The algo

Let's enumerate parts of matrix that will be moved: at each step we
switch 2 and 3 leaving 1 and 4 intact.

The first move:

11112222     11113333
11112222     11113333
11112222     11113333
11112222  => 11113333
33334444     22224444
33334444     22224444
33334444     22224444
33334444     22224444

Move #2:

11221122     11331133
11221122     11331133
33443344     22442244
33443344  => 22442244
11221122     11331133
11221122     11331133
33443344     22442244
33443344     22442244

Move #3:

12121212     13131313
34343434     24242424
12121212     13131313
34343434  => 24242424
12121212     13131313
34343434     24242424
12121212     13131313
34343434     24242424

So we switch squares, then subsquares and so on. Or in reverse order.
Actually there are many other correct orders.

Hacker's Delight gives the following macro (I modified it a bit) to
switch parts of 2 rows.

#define swap(a0, a1, j, m) \
t = (a0 ^ (a1 >> j)) & m; \
a0 ^= t; \
a1 ^= (t << j);

So to switch squares in the first step above, we would need 4 swaps.

Hacker's Delight seems to target RISCs and reduce number of constants,
while my tests showed that variable for mask is not needed on x86-64.
Also with explicit mask, it is possible to rearrange swaps to reduce

A mail from a reader mentions problems with constants on RISCs:
http://hackersdelight.org/corres.txt
"Note that the magic multiplier needed for summing the 6-bit
fields can be derived easily from one of the existing masks.
This is advantageous on processors with poor support for
constructing constants, such as ARM (the ARM can execute
t = t & (t << 2) in a single instruction, on the other hand)."

I wrote a generator to populate code to transpose 64x64 bit matrix.

Straight forward generator to produce code like in Hacker's Delight:

perl -e 'sub t { my \$i = shift; my \$j = shift; my \$m = shift; return unless \$j > 0; for my \$k (\$i .. \$i + \$j - 1) { printf "swap(bits[%d], bits[%d], %d, 0x%016xULL);\n", \$k, \$k + \$j, \$j, \$m; } my \$nj = \$j >> 1; \$m ^= \$m << \$nj; t(\$i, \$nj, \$m); t(\$i + \$j, \$nj, \$m); } t(0, 32, 0x00000000ffffffff)'

With rearrangements:

perl -e 'sub t { my \$i = shift; my \$j = shift; my \$m = shift; return unless \$j > 0; for my \$k (\$i .. \$i + \$j - 1) { my \$s = ((\$m >> 1) & ~\$m) ? "" : "1"; printf "swap\$s(bits[%d], bits[%d], %d, 0x%016xULL);\n", \$k, \$k + \$j, \$j, \$m; } my \$nj = \$j >> 1; \$m ^= \$m << \$nj; t(\$i, \$nj, \$m); t(\$i + \$j, \$nj, \$m); } t(0, 32, 0x00000000ffffffff)'

Without rearrangements, it is useful to apply mask earlier for the
first round:

#define swap1(a0, a1, j, m) \
t = (a0 & m) ^ (a1 >> j); \
a0 ^= t; \
a1 ^= (t << j);

Code with rearrangements is 0.5% faster than regular code.
Fluctuations might occur though...

I think current descrypt and LM uses something similar. LM could
transpose the matrix back and remove FMT_BS flag because we expect
many hashes (right?) unlike with salted descrypt.

Also Hacker's Delight mentions that rotate instruction can reduce
swap() macro from 6 instructions to 4. We might want to do that on
XOP and AVX.

My test code is attached.

Thanks!

--
Regards,
Aleksey Cherepanov

View attachment "simple_bit_layout.c" of type "text/x-csrc" (12229 bytes)