
Date: Sun, 19 Jul 2015 23:50:50 +0300
From: Aleksey Cherepanov <lyosha@...nwall.com>
To: johndev@...ts.openwall.com
Subject: faster way to transpose matrix of words for sse and bits for bitslice
It seems that transposition is a common operation in john: sse needs
to transpose matrix of integers and bitslice needs to transpose a
matrix of bits.
I found an algo to transpose bit matrices faster than bit by bit in
Hacker's Delight. I saw a similar algo in other places, for instance
for integers with sse.
http://www.hackersdelight.org/hdcodetxt/transpose32.c.txt
http://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html
The algo
Let's enumerate parts of matrix that will be moved: at each step we
switch 2 and 3 leaving 1 and 4 intact.
The first move:
11112222 11113333
11112222 11113333
11112222 11113333
11112222 => 11113333
33334444 22224444
33334444 22224444
33334444 22224444
33334444 22224444
Move #2:
11221122 11331133
11221122 11331133
33443344 22442244
33443344 => 22442244
11221122 11331133
11221122 11331133
33443344 22442244
33443344 22442244
Move #3:
12121212 13131313
34343434 24242424
12121212 13131313
34343434 => 24242424
12121212 13131313
34343434 24242424
12121212 13131313
34343434 24242424
So we switch squares, then subsquares and so on. Or in reverse order.
Actually there are many other correct orders.
Hacker's Delight gives the following macro (I modified it a bit) to
switch parts of 2 rows.
#define swap(a0, a1, j, m) \
t = (a0 ^ (a1 >> j)) & m; \
a0 ^= t; \
a1 ^= (t << j);
So to switch squares in the first step above, we would need 4 swaps.
Hacker's Delight seems to target RISCs and reduce number of constants,
while my tests showed that variable for mask is not needed on x8664.
Also with explicit mask, it is possible to rearrange swaps to reduce
loads from memory.
A mail from a reader mentions problems with constants on RISCs:
http://hackersdelight.org/corres.txt
"Note that the magic multiplier needed for summing the 6bit
fields can be derived easily from one of the existing masks.
This is advantageous on processors with poor support for
constructing constants, such as ARM (the ARM can execute
t = t & (t << 2) in a single instruction, on the other hand)."
I wrote a generator to populate code to transpose 64x64 bit matrix.
Straight forward generator to produce code like in Hacker's Delight:
perl e 'sub t { my $i = shift; my $j = shift; my $m = shift; return unless $j > 0; for my $k ($i .. $i + $j  1) { printf "swap(bits[%d], bits[%d], %d, 0x%016xULL);\n", $k, $k + $j, $j, $m; } my $nj = $j >> 1; $m ^= $m << $nj; t($i, $nj, $m); t($i + $j, $nj, $m); } t(0, 32, 0x00000000ffffffff)'
With rearrangements:
perl e 'sub t { my $i = shift; my $j = shift; my $m = shift; return unless $j > 0; for my $k ($i .. $i + $j  1) { my $s = (($m >> 1) & ~$m) ? "" : "1"; printf "swap$s(bits[%d], bits[%d], %d, 0x%016xULL);\n", $k, $k + $j, $j, $m; } my $nj = $j >> 1; $m ^= $m << $nj; t($i, $nj, $m); t($i + $j, $nj, $m); } t(0, 32, 0x00000000ffffffff)'
Without rearrangements, it is useful to apply mask earlier for the
first round:
#define swap1(a0, a1, j, m) \
t = (a0 & m) ^ (a1 >> j); \
a0 ^= t; \
a1 ^= (t << j);
Code with rearrangements is 0.5% faster than regular code.
Fluctuations might occur though...
I think current descrypt and LM uses something similar. LM could
transpose the matrix back and remove FMT_BS flag because we expect
many hashes (right?) unlike with salted descrypt.
Also Hacker's Delight mentions that rotate instruction can reduce
swap() macro from 6 instructions to 4. We might want to do that on
XOP and AVX.
My test code is attached.
Thanks!

Regards,
Aleksey Cherepanov
View attachment "simple_bit_layout.c" of type "text/xcsrc" (12229 bytes)
Powered by blists  more mailing lists