john-dev - Re: Parallella: scrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAepdCa0hiFwEO8znfgEXu+QD2rjE+sNVSqZbC=iVhaLd_u9Cw@mail.gmail.com>
Date: Thu, 4 Jul 2013 16:38:46 +0100
From: Rafael Waldo Delgado Doblas <lord.rafa@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: scrypt

Hello,

I will focus on Litecoin integration, I would like confirm this:

Cgminer divides the work between the different host connected to the mining
pool, then it will send keys to be hashed with scrypt and the hash would be
compared with the one that we want to break (normal brute-force attack).
Once the correct key has been found, it will be shared with the pool in
order to generate the Litecoins that would be shared between the all hosts.

We are going to use JtR to hash the keys that has been sent to us, compare
the hash with the ciphertext that we want to break and return if is valid
or not.

Also Cgminer can work in solo mode but I suppose that this not that much
interesting and in anyway for JtR communication only changes the key set
that we are going to hash.

Then in order to integrate Cgminer with JtR, I need to remove from Cgminer:
scrypt, sha256, fpga,gpu support and any other hash function (since we'll
hash from JtR) and only keep Cgminner's pooling functions.

Then it doesn't matter if Cgminer supports CPU mining or not because this
work will be accomplish by JtR core since we will only use Cgminer to
connect with the rest of the Litecoin network.

Cgminer also suports bitcoin then also bcrypt will be needed to a full
Cgminer but since Katja still working on it I will comment any bitcoin
reference for now.

Regards,
Rafael.



2013/7/4 Solar Designer <solar@...nwall.com>

> Hi Rafael,
>
> On Tue, Jul 02, 2013 at 10:58:49PM +0100, Rafael Waldo Delgado Doblas
> wrote:
> > I have checked the scrypt TMTO version that you send me. If I understand
> > the algorithm correctly, In order to reduce the memory used, we compute
> Xi
> > sequentially* in the first "for" but instead of store all Xi in V, we
> only
> > store those that makes i%exploit_tmto=0. Then in the second "for" we need
> > to calculate again Xj if j%exploit_tmto != 0, using the immediately
> smaller
> > Xb that makes b%exploit_tmto=0, as base to calculate it. Please tell me
> if
> > I make any mistake.
>
> Your description above looks correct, but I think it's better to put it
> in plain English:
>
> To exploit scrypt's time-memory tradeoff, we may store a subset of V
> elements in SMix's first loop, and recompute the missing elements from
> the preceding elements (that we did store) in the second loop.  Thus, we
> save memory by storing fewer V elements, but we pay for this with extra
> computation in SMix's second loop.
>
> Which elements of V we skip/recompute, if any, is up to us to choose.
> The sample implementation I sent you has that variable named
> exploit_tmto, where e.g. a value of 2 would result in skipping of every
> other element.  Some Litecoin miners I saw call the same thing "gap" -
> where e.g. a gap of 2 would achieve the same effect of skipping every
> other element.  (I think calling this "gap" is slightly misleading,
> because the actual gap between stored elements is smaller by one than
> this value.)
>
> BTW, if we wanted to reduce memory usage e.g. to 75% of original and not
> more (e.g. because we did not need to reduce it more, and thus didn't
> want to pay more than we have to in terms of computation time), we'd
> prefer to implement things slightly differently, where we'd sometimes
> store consecutive elements, but sometimes skip one.  I mention this to
> illustrate that the straightforward modulo division (or equivalent) is
> not the only way to do it (and not always the best).  Technically, we
> can skip/recompute arbitrary elements.
>
> > Also would like to know, why do you said to me that this approach is not
> > valid to be integrated into JtR?
>
> That's a misunderstanding.  What I guess you're referring to is that I
> said that we currently need scrypt on Epiphany only for Litecoin, rather
> than in general, because scrypt should hopefully be used (for defense)
> with high enough memory settings that running it on Epiphany (with
> high TMTO factors as we'd have to) would be painfully slow as compared
> to running it e.g. on typical x86 CPUs.  Hopefully, Litecoin's 128 KB is
> more of an exception than the rule.
>
> Additionally, with scrypt set to use megabytes of memory TMTO might not
> be beneficial on current CPUs.  Theoretically, it could be beneficial in
> some special cases due to reducing the working set size for more of it
> to fit in a smaller and faster cache, so if you have time for this
> experiment please feel free to play with it.  However, as you have seen
> my current scrypt format shows pretty good OpenMP scalability when using
> 16 MB (per thread), which, given that these systems have fewer memory
> channels than CPU cores, suggests that it is mostly not memory-bound.
> Maybe that will change once we implement interleaving to speedup the
> Salsa20/8 core computations, though.
>
> > Is it only because as you said before,
> > only a smaller portion of the allocated region is actually used
>
> Of course not.  That's a property of one hack only, and it's trivial to
> fix (by allocating just the right amount of memory) if we cared to.
> That hack was never meant for actual use; it was only meant for testing,
> such as to obtain actual figures for increase in computation with
> different TMTO factors when using an actual test vector.  For actual
> use, it'd need to be based other than on the reference implementation,
> it'd need to avoid modulo division (which is a somewhat costly
> operation), and it'd need to do the memory allocation right.  (Modulo
> division may be avoided by maintaining an extra index variable, which
> would be reset to 0 whenever it hits the TMTO factor value.)
>
> > I almost know what the functions of scrypt do, at least in a general
> idea,
> > but I still having doubts with the integerify(X, r), may someone explain
> it
> > a little bit better? because in "
> http://www.tarsnap.com/scrypt/scrypt.pdf"
> > I couldn't see too much about it.
>
> It produces an integer out of the current block.  What else do you want
> to know about it?
>
> > Furthermore the tmto is only useful for Epiphany because we have only 32K
> > and Litecoin needs 128K, is it right?.
>
> Yes.  If we had much more local memory, we wouldn't want to use TMTO.
>
> > Also the only way to use paralleling
> > computer is create multiple instances of scrypt algorithm computing
> > different keys*, is it true?.
>
> Almost.  Any different inputs would do (although for Litecoin in
> particular we're more limited), and theoretically scrypt has the "p"
> parameter, which is for parallelism.  In case we're computing (whether
> for attack or defense) scrypt with p > 1, we can exploit parallelism
> available at this level (and we might not need multiple instances of
> scrypt then).  However, in practice I think scrypt is currently most
> commonly used with p = 1 (and there are good reasons for this).
>
> > After this, what should be my next task? Litecoin integration with JtR
>
> Here are some tasks for you:
>
> 1. Litecoin mining integration into JtR, using code from cgminer.
> I think we should use code from latest cgminer that still had Litecoin
> mining on CPU, and we should support both CPU and GPU.
>
> 2. Litecoin mining on Epiphany.  I think there's little need to have
> this in JtR (although we may revisit this later), so maybe implement it
> as a patch for cgminer?  This will be mostly a proof-of-concept.  We
> don't expect to achieve good Litecoin mining performance on current
> Epiphany chips in absolute terms - at best, it'd be decent performance
> per Watt.
>
> 3. Litecoin mining on Xeon Phi.  You can't work on this yet since we
> don't have a Phi ready for use yet - but we expect to soon.  I think
> this may actually result in good Litecoin mining performance, unlike
> the above.
>
> There are also these potential tasks:
>
> 4. Implement TMTO for JtR scrypt format, test different TMTO factors
> with different scrypt settings (that is, for different hashes being
> cracked) and see if there are any combinations where TMTO makes things
> faster (due to caches).
>
> 5. Implement interleaving (of multiple scrypt instances per thread) for
> JtR scrypt format.  To understand interleaving, you may take a look at
> different revisions of this program I wrote:
>
> http://download.openwall.net/pub/projects/php_mt_seed/
> http://cvsweb.openwall.com/cgi/cvsweb.cgi/projects/php_mt_seed/
>
> The oldest version available there has 2x interleave and no SIMD; the
> latest has 4x interleave when built without SIMD, and 8x interleave (in
> addition to SIMD) when built with 4x SIMD (meaning 32 seeds being tested
> simultaneously, per thread - and many more per CPU chip).  To figure
> this out, you'd look at intermediate code revisions and diffs as well.
>
> In plain English: we interleave instructions from 2 or more instances of
> the cipher, etc. being cracked in order to provide more
> instruction-level parallelism, which may be needed to allow for and
> maximize multiple issue of instructions and to avoid data dependency
> stalls by hiding instruction latencies.  For example, with one instance
> we could computing:
>
> c = a + b;
> e = c + d;
>
> We have a data dependency here on "c", which prevents these two
> additions from being performed in parallel (by different ALUs, if two or
> more are present in our one CPU core).  Additionally, if the latency of
> one addition is e.g. 2 cycles (that is, the result is not available for
> use immediately, although the next instruction may issue immediately if
> it is unrelated), then there may be a stall (even on a single-issue CPU
> with just one ALU).  We may solve these problems by interleaving two or
> more instances:
>
> c1 = a1 + b1;
> c2 = a2 + b2;
> e1 = c1 + d1;
> e2 = c2 + d2;
>
> As you can see, there's enough parallelism for a dual-issue CPU core now
> (with two ALUs), as long as there are no high instruction latencies, as
> well as for a single-issue CPU core with 2-cycle latencies.  However,
> still not enough parallelism for a hypothetical dual-issue CPU core with
> 2-cycle latencies - we'd need 3x or 4x interleave there.
>
>
> Of the tasks above, if you work on tasks 1, 2, 3 I think we'd actually
> use your code.  If you work on tasks 4, 5, your code will be for
> testing/PoC, but I'm afraid I'd have to re-do it from scratch then (to
> meet some quality requirements that I can't communicate to you easily).
> Yet you may give these a try and it might be somewhat helpful.
>
> Whatever you choose to focus on, you need to increase your pace A LOT.
> You should aim to complete (get working code for) any of the five
> numbered tasks above in no more than 1 week per task - and preferably
> much less (try to do it in a day? I know it's tough, but just try!)
> I think basic (yet working) implementations of 1, 4, 5 may be completed
> in 1 day each.
>
> > or try to move scrypt to Epiphany?
>
> There's too little need to have the JtR scrypt format working on
> Epiphany.  Rather, we promised (to the Parallella community) to try
> Litecoin mining on Epiphany, so we should do that.  However, if it's
> easier for you to go via implementing scrypt proper on Epiphany first,
> feel free to do that and implement Litecoin mining on top of it.  This
> means that the "gap" won't be hard-coded, though, which may have a
> slight performance hit compared to the specialized implementation with a
> hard-coded "gap" - yet I am OK with either approach, and we'd be able to
> specialize the code later if you prefer.
>
> Thanks,
>
> Alexander
>

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.