Date: Thu, 4 Jul 2013 14:28:05 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: scrypt Hi Rafael, On Tue, Jul 02, 2013 at 10:58:49PM +0100, Rafael Waldo Delgado Doblas wrote: > I have checked the scrypt TMTO version that you send me. If I understand > the algorithm correctly, In order to reduce the memory used, we compute Xi > sequentially* in the first "for" but instead of store all Xi in V, we only > store those that makes i%exploit_tmto=0. Then in the second "for" we need > to calculate again Xj if j%exploit_tmto != 0, using the immediately smaller > Xb that makes b%exploit_tmto=0, as base to calculate it. Please tell me if > I make any mistake. Your description above looks correct, but I think it's better to put it in plain English: To exploit scrypt's time-memory tradeoff, we may store a subset of V elements in SMix's first loop, and recompute the missing elements from the preceding elements (that we did store) in the second loop. Thus, we save memory by storing fewer V elements, but we pay for this with extra computation in SMix's second loop. Which elements of V we skip/recompute, if any, is up to us to choose. The sample implementation I sent you has that variable named exploit_tmto, where e.g. a value of 2 would result in skipping of every other element. Some Litecoin miners I saw call the same thing "gap" - where e.g. a gap of 2 would achieve the same effect of skipping every other element. (I think calling this "gap" is slightly misleading, because the actual gap between stored elements is smaller by one than this value.) BTW, if we wanted to reduce memory usage e.g. to 75% of original and not more (e.g. because we did not need to reduce it more, and thus didn't want to pay more than we have to in terms of computation time), we'd prefer to implement things slightly differently, where we'd sometimes store consecutive elements, but sometimes skip one. I mention this to illustrate that the straightforward modulo division (or equivalent) is not the only way to do it (and not always the best). Technically, we can skip/recompute arbitrary elements. > Also would like to know, why do you said to me that this approach is not > valid to be integrated into JtR? That's a misunderstanding. What I guess you're referring to is that I said that we currently need scrypt on Epiphany only for Litecoin, rather than in general, because scrypt should hopefully be used (for defense) with high enough memory settings that running it on Epiphany (with high TMTO factors as we'd have to) would be painfully slow as compared to running it e.g. on typical x86 CPUs. Hopefully, Litecoin's 128 KB is more of an exception than the rule. Additionally, with scrypt set to use megabytes of memory TMTO might not be beneficial on current CPUs. Theoretically, it could be beneficial in some special cases due to reducing the working set size for more of it to fit in a smaller and faster cache, so if you have time for this experiment please feel free to play with it. However, as you have seen my current scrypt format shows pretty good OpenMP scalability when using 16 MB (per thread), which, given that these systems have fewer memory channels than CPU cores, suggests that it is mostly not memory-bound. Maybe that will change once we implement interleaving to speedup the Salsa20/8 core computations, though. > Is it only because as you said before, > only a smaller portion of the allocated region is actually used Of course not. That's a property of one hack only, and it's trivial to fix (by allocating just the right amount of memory) if we cared to. That hack was never meant for actual use; it was only meant for testing, such as to obtain actual figures for increase in computation with different TMTO factors when using an actual test vector. For actual use, it'd need to be based other than on the reference implementation, it'd need to avoid modulo division (which is a somewhat costly operation), and it'd need to do the memory allocation right. (Modulo division may be avoided by maintaining an extra index variable, which would be reset to 0 whenever it hits the TMTO factor value.) > I almost know what the functions of scrypt do, at least in a general idea, > but I still having doubts with the integerify(X, r), may someone explain it > a little bit better? because in "http://www.tarsnap.com/scrypt/scrypt.pdf" > I couldn't see too much about it. It produces an integer out of the current block. What else do you want to know about it? > Furthermore the tmto is only useful for Epiphany because we have only 32K > and Litecoin needs 128K, is it right?. Yes. If we had much more local memory, we wouldn't want to use TMTO. > Also the only way to use paralleling > computer is create multiple instances of scrypt algorithm computing > different keys*, is it true?. Almost. Any different inputs would do (although for Litecoin in particular we're more limited), and theoretically scrypt has the "p" parameter, which is for parallelism. In case we're computing (whether for attack or defense) scrypt with p > 1, we can exploit parallelism available at this level (and we might not need multiple instances of scrypt then). However, in practice I think scrypt is currently most commonly used with p = 1 (and there are good reasons for this). > After this, what should be my next task? Litecoin integration with JtR Here are some tasks for you: 1. Litecoin mining integration into JtR, using code from cgminer. I think we should use code from latest cgminer that still had Litecoin mining on CPU, and we should support both CPU and GPU. 2. Litecoin mining on Epiphany. I think there's little need to have this in JtR (although we may revisit this later), so maybe implement it as a patch for cgminer? This will be mostly a proof-of-concept. We don't expect to achieve good Litecoin mining performance on current Epiphany chips in absolute terms - at best, it'd be decent performance per Watt. 3. Litecoin mining on Xeon Phi. You can't work on this yet since we don't have a Phi ready for use yet - but we expect to soon. I think this may actually result in good Litecoin mining performance, unlike the above. There are also these potential tasks: 4. Implement TMTO for JtR scrypt format, test different TMTO factors with different scrypt settings (that is, for different hashes being cracked) and see if there are any combinations where TMTO makes things faster (due to caches). 5. Implement interleaving (of multiple scrypt instances per thread) for JtR scrypt format. To understand interleaving, you may take a look at different revisions of this program I wrote: http://download.openwall.net/pub/projects/php_mt_seed/ http://cvsweb.openwall.com/cgi/cvsweb.cgi/projects/php_mt_seed/ The oldest version available there has 2x interleave and no SIMD; the latest has 4x interleave when built without SIMD, and 8x interleave (in addition to SIMD) when built with 4x SIMD (meaning 32 seeds being tested simultaneously, per thread - and many more per CPU chip). To figure this out, you'd look at intermediate code revisions and diffs as well. In plain English: we interleave instructions from 2 or more instances of the cipher, etc. being cracked in order to provide more instruction-level parallelism, which may be needed to allow for and maximize multiple issue of instructions and to avoid data dependency stalls by hiding instruction latencies. For example, with one instance we could computing: c = a + b; e = c + d; We have a data dependency here on "c", which prevents these two additions from being performed in parallel (by different ALUs, if two or more are present in our one CPU core). Additionally, if the latency of one addition is e.g. 2 cycles (that is, the result is not available for use immediately, although the next instruction may issue immediately if it is unrelated), then there may be a stall (even on a single-issue CPU with just one ALU). We may solve these problems by interleaving two or more instances: c1 = a1 + b1; c2 = a2 + b2; e1 = c1 + d1; e2 = c2 + d2; As you can see, there's enough parallelism for a dual-issue CPU core now (with two ALUs), as long as there are no high instruction latencies, as well as for a single-issue CPU core with 2-cycle latencies. However, still not enough parallelism for a hypothetical dual-issue CPU core with 2-cycle latencies - we'd need 3x or 4x interleave there. Of the tasks above, if you work on tasks 1, 2, 3 I think we'd actually use your code. If you work on tasks 4, 5, your code will be for testing/PoC, but I'm afraid I'd have to re-do it from scratch then (to meet some quality requirements that I can't communicate to you easily). Yet you may give these a try and it might be somewhat helpful. Whatever you choose to focus on, you need to increase your pace A LOT. You should aim to complete (get working code for) any of the five numbered tasks above in no more than 1 week per task - and preferably much less (try to do it in a day? I know it's tough, but just try!) I think basic (yet working) implementations of 1, 4, 5 may be completed in 1 day each. > or try to move scrypt to Epiphany? There's too little need to have the JtR scrypt format working on Epiphany. Rather, we promised (to the Parallella community) to try Litecoin mining on Epiphany, so we should do that. However, if it's easier for you to go via implementing scrypt proper on Epiphany first, feel free to do that and implement Litecoin mining on top of it. This means that the "gap" won't be hard-coded, though, which may have a slight performance hit compared to the specialized implementation with a hard-coded "gap" - yet I am OK with either approach, and we'd be able to specialize the code later if you prefer. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.