crypt-dev - scrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20110509194543.GA631@openwall.com>
Date: Mon, 9 May 2011 23:45:43 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: scrypt

Yuri, all -

I've just experimented with the scrypt file encryption program a little
bit to see how the theory from the slides and the paper matches the
actual code.  I downloaded scrypt-1.1.6.tgz, compiled, and ran it to
encrypt a file.  (I used a without-SSE2 build for my experiments.
Apparently, SSE2 may be enabled with "./configure --enable-sse2", but I
did not play with this yet.)

I also added -DDEBUG to CFLAGS in the Makefile after it's been
generated by configure (or I could have specified the CFLAGS via the env
var when running configure, but I did not).  This enables some debugging
output - such as the scrypt KDF parameters that the program ends up
using, which it chooses based on a CPU speed benchmark in order to have
the key derivation take approx. 5 seconds by default.

The most relevant files are:

crypto_scrypt-nosse.c
crypto_scrypt-ref.c
crypto_scrypt-sse.c

under lib/crypto.  Only one of these gets used; in my case, it was
crypto_scrypt-nosse.c.  The -ref (reference implementation) is not
referenced from anywhere else at all (as far as I could tell from a
quick grep); it differs from the -nosse one by not having some loops
unrolled (so it is slightly easier to understand).

Almost all processing time is spent in this loop in crypto_scrypt():

	for (i = 0; i < p; i++) {
		/* 3: B_i <-- MF(B_i, N) */
		smix(&B[i * 128 * r], r, N, V, XY);
	}

There are three parameters that affect the running time of this loop:
p, r, and N.  I ran this on an old Pentium 3 at 1 GHz (yes, this is a
reason why I did not enable SSE2), and got the following values: p=3,
r=8, N=32768.  The last two of these correspond to 32 MB of memory used.

To see these values, I added:

#include <stdio.h>
...
	fprintf(stderr, "p=%u N=%llu r=%u\n", p, N, r);

(IIRC, I did not see the value of r with a mere -DDEBUG.)

"p" is supposed to allow for parallelism.  To check for this, I tried
re-arranging the loop such that it would try the indices in reverse
order.  I changed:

	uint32_t i;

to:

	int i;

and:

	for (i = 0; i < p; i++) {

to:

	for (i = p - 1; i >= 0; i--) {

The program still worked fine, decrypting a previously encrypted file
(that was encrypted by the unpatched version).  I think this confirms
that things would also work fine if this loop's iterations were executed
in parallel.  (Of course, p=3 is rather low for this, but higher values
may be used.)

The Salsa20/8 core only has enough parallelism in it to make use of
128-bit SIMD vectors, such as on SSE2.  This is easily seen in
salsa20_8() in crypto_scrypt-sse.c.

On future CPUs with wider vectors and on GPUs, it could be difficult to
bring more parallelism down to the SIMD level, perhaps all the way from
the high-level loop mentioned above.  This might or might not be
practical.

Thus, scrypt is possibly limited in its ability to fully use future
CPUs and GPUs with wider than 128-bit SIMD vectors.  This may need
further research and testing of potential implementations (that try to
bring more parallelism to this level).

On the other hand, if we choose to implement scrypt in FPGA (perhaps
using RAM on the same board with the FPGA chip) and not worry about the
efficiency of our hashing method on CPUs/GPUs, that is fine... but in
that case we could pick something deliberately unfriendly to CPUs/GPUs
and use it as the proper component under the "alternative approach" I
described in here earlier today.  And leave scrypt for the CPU.

A primary advantage of scrypt is its sequential memory-hard property,
which means that ASIC implementations are relatively expensive (compared
to ASIC implementations of other KDFs).  However, for our project it
means that if we choose to implement multiple components as I described
in the "alternative approach" posting, we might prefer not to waste our
FPGA space on scrypt, which is CPU-friendly (for up to 128-bit vectors).

As usual, comments are welcome.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.