john-dev - Re: Parallella: scrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130704102805.GA10834@openwall.com>
Date: Thu, 4 Jul 2013 14:28:05 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: scrypt

Hi Rafael,

On Tue, Jul 02, 2013 at 10:58:49PM +0100, Rafael Waldo Delgado Doblas wrote:
> I have checked the scrypt TMTO version that you send me. If I understand
> the algorithm correctly, In order to reduce the memory used, we compute Xi
> sequentially* in the first "for" but instead of store all Xi in V, we only
> store those that makes i%exploit_tmto=0. Then in the second "for" we need
> to calculate again Xj if j%exploit_tmto != 0, using the immediately smaller
> Xb that makes b%exploit_tmto=0, as base to calculate it. Please tell me if
> I make any mistake.

Your description above looks correct, but I think it's better to put it
in plain English:

To exploit scrypt's time-memory tradeoff, we may store a subset of V
elements in SMix's first loop, and recompute the missing elements from
the preceding elements (that we did store) in the second loop.  Thus, we
save memory by storing fewer V elements, but we pay for this with extra
computation in SMix's second loop.

Which elements of V we skip/recompute, if any, is up to us to choose.
The sample implementation I sent you has that variable named
exploit_tmto, where e.g. a value of 2 would result in skipping of every
other element.  Some Litecoin miners I saw call the same thing "gap" -
where e.g. a gap of 2 would achieve the same effect of skipping every
other element.  (I think calling this "gap" is slightly misleading,
because the actual gap between stored elements is smaller by one than
this value.)

BTW, if we wanted to reduce memory usage e.g. to 75% of original and not
more (e.g. because we did not need to reduce it more, and thus didn't
want to pay more than we have to in terms of computation time), we'd
prefer to implement things slightly differently, where we'd sometimes
store consecutive elements, but sometimes skip one.  I mention this to
illustrate that the straightforward modulo division (or equivalent) is
not the only way to do it (and not always the best).  Technically, we
can skip/recompute arbitrary elements.

> Also would like to know, why do you said to me that this approach is not
> valid to be integrated into JtR?

That's a misunderstanding.  What I guess you're referring to is that I
said that we currently need scrypt on Epiphany only for Litecoin, rather
than in general, because scrypt should hopefully be used (for defense)
with high enough memory settings that running it on Epiphany (with
high TMTO factors as we'd have to) would be painfully slow as compared
to running it e.g. on typical x86 CPUs.  Hopefully, Litecoin's 128 KB is
more of an exception than the rule.

Additionally, with scrypt set to use megabytes of memory TMTO might not
be beneficial on current CPUs.  Theoretically, it could be beneficial in
some special cases due to reducing the working set size for more of it
to fit in a smaller and faster cache, so if you have time for this
experiment please feel free to play with it.  However, as you have seen
my current scrypt format shows pretty good OpenMP scalability when using
16 MB (per thread), which, given that these systems have fewer memory
channels than CPU cores, suggests that it is mostly not memory-bound.
Maybe that will change once we implement interleaving to speedup the
Salsa20/8 core computations, though.

> Is it only because as you said before,
> only a smaller portion of the allocated region is actually used

Of course not.  That's a property of one hack only, and it's trivial to
fix (by allocating just the right amount of memory) if we cared to.
That hack was never meant for actual use; it was only meant for testing,
such as to obtain actual figures for increase in computation with
different TMTO factors when using an actual test vector.  For actual
use, it'd need to be based other than on the reference implementation,
it'd need to avoid modulo division (which is a somewhat costly
operation), and it'd need to do the memory allocation right.  (Modulo
division may be avoided by maintaining an extra index variable, which
would be reset to 0 whenever it hits the TMTO factor value.)

> I almost know what the functions of scrypt do, at least in a general idea,
> but I still having doubts with the integerify(X, r), may someone explain it
> a little bit better? because in "http://www.tarsnap.com/scrypt/scrypt.pdf"
> I couldn't see too much about it.

It produces an integer out of the current block.  What else do you want
to know about it?

> Furthermore the tmto is only useful for Epiphany because we have only 32K
> and Litecoin needs 128K, is it right?.

Yes.  If we had much more local memory, we wouldn't want to use TMTO.

> Also the only way to use paralleling
> computer is create multiple instances of scrypt algorithm computing
> different keys*, is it true?.

Almost.  Any different inputs would do (although for Litecoin in
particular we're more limited), and theoretically scrypt has the "p"
parameter, which is for parallelism.  In case we're computing (whether
for attack or defense) scrypt with p > 1, we can exploit parallelism
available at this level (and we might not need multiple instances of
scrypt then).  However, in practice I think scrypt is currently most
commonly used with p = 1 (and there are good reasons for this).

> After this, what should be my next task? Litecoin integration with JtR

Here are some tasks for you:

1. Litecoin mining integration into JtR, using code from cgminer.
I think we should use code from latest cgminer that still had Litecoin
mining on CPU, and we should support both CPU and GPU.

2. Litecoin mining on Epiphany.  I think there's little need to have
this in JtR (although we may revisit this later), so maybe implement it
as a patch for cgminer?  This will be mostly a proof-of-concept.  We
don't expect to achieve good Litecoin mining performance on current
Epiphany chips in absolute terms - at best, it'd be decent performance
per Watt.

3. Litecoin mining on Xeon Phi.  You can't work on this yet since we
don't have a Phi ready for use yet - but we expect to soon.  I think
this may actually result in good Litecoin mining performance, unlike
the above.

There are also these potential tasks:

4. Implement TMTO for JtR scrypt format, test different TMTO factors
with different scrypt settings (that is, for different hashes being
cracked) and see if there are any combinations where TMTO makes things
faster (due to caches).

5. Implement interleaving (of multiple scrypt instances per thread) for
JtR scrypt format.  To understand interleaving, you may take a look at
different revisions of this program I wrote:

http://download.openwall.net/pub/projects/php_mt_seed/
http://cvsweb.openwall.com/cgi/cvsweb.cgi/projects/php_mt_seed/

The oldest version available there has 2x interleave and no SIMD; the
latest has 4x interleave when built without SIMD, and 8x interleave (in
addition to SIMD) when built with 4x SIMD (meaning 32 seeds being tested
simultaneously, per thread - and many more per CPU chip).  To figure
this out, you'd look at intermediate code revisions and diffs as well.

In plain English: we interleave instructions from 2 or more instances of
the cipher, etc. being cracked in order to provide more
instruction-level parallelism, which may be needed to allow for and
maximize multiple issue of instructions and to avoid data dependency
stalls by hiding instruction latencies.  For example, with one instance
we could computing:

c = a + b;
e = c + d;

We have a data dependency here on "c", which prevents these two
additions from being performed in parallel (by different ALUs, if two or
more are present in our one CPU core).  Additionally, if the latency of
one addition is e.g. 2 cycles (that is, the result is not available for
use immediately, although the next instruction may issue immediately if
it is unrelated), then there may be a stall (even on a single-issue CPU
with just one ALU).  We may solve these problems by interleaving two or
more instances:

c1 = a1 + b1;
c2 = a2 + b2;
e1 = c1 + d1;
e2 = c2 + d2;

As you can see, there's enough parallelism for a dual-issue CPU core now
(with two ALUs), as long as there are no high instruction latencies, as
well as for a single-issue CPU core with 2-cycle latencies.  However,
still not enough parallelism for a hypothetical dual-issue CPU core with
2-cycle latencies - we'd need 3x or 4x interleave there.


Of the tasks above, if you work on tasks 1, 2, 3 I think we'd actually
use your code.  If you work on tasks 4, 5, your code will be for
testing/PoC, but I'm afraid I'd have to re-do it from scratch then (to
meet some quality requirements that I can't communicate to you easily).
Yet you may give these a try and it might be somewhat helpful.

Whatever you choose to focus on, you need to increase your pace A LOT.
You should aim to complete (get working code for) any of the five
numbered tasks above in no more than 1 week per task - and preferably
much less (try to do it in a day? I know it's tough, but just try!)
I think basic (yet working) implementations of 1, 4, 5 may be completed
in 1 day each.

> or try to move scrypt to Epiphany?

There's too little need to have the JtR scrypt format working on
Epiphany.  Rather, we promised (to the Parallella community) to try
Litecoin mining on Epiphany, so we should do that.  However, if it's
easier for you to go via implementing scrypt proper on Epiphany first,
feel free to do that and implement Litecoin mining on top of it.  This
means that the "gap" won't be hard-coded, though, which may have a
slight performance hit compared to the specialized implementation with a
hard-coded "gap" - yet I am OK with either approach, and we'd be able to
specialize the code later if you prefer.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.