Date: Mon, 18 Mar 2013 01:53:04 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: Re: scrypt TMTO defeater Colin, Anthony, all - On Sun, Mar 17, 2013 at 08:03:21AM +0400, Solar Designer wrote: > So I've tried to implement an enhancement like what I described above: > writing not only to V_j, but also to V_i (with another value). > Unfortunately, it turns out that BlockMix's shuffling(*) gets in the way > of efficient implementations, and _possibly_ the inefficiency mostly hurts > cleaner CPU code, as opposed to hacks and ASICs. Specifically, when the > V element we write into is or might be the same as V_j, in a clean > implementation we have to postpone those writes until the BlockMix > completes, and this means re-reading the same values from memory > (hopefully, from L1 cache) rather than simply storing register values > into two locations at once (such as X and V_j). Essentially, we'd have a hard to optimize out blkcpy(), which is bad in that we're not doing any crypto on the CPU while it runs. I think we prefer to have a fine mix of crypto code and memory accesses, which makes greater use of CPU resources. > In order to do the > latter without introducing extra checks, I had to limit the writes to > elements that are definitely not same as V_j; I chose (~j & (N - 1)) for > their indices. Here's what I think is a better idea: instead of "NOT j (mod N)", use "j XOR 1" (assumes N >= 2). This way, with halved r (e.g., moving from r=8 to r=4 when enabling the TMTO defeater) we maintain the same TLB pressure (since the V element to update is then on the same memory page, if r was such that one V element fit in a page at all). As a bonus, this avoids the extra "& (N - 1)". Also, rather than merely overwrite V elements, we can use them first. I've implemented XOR'ing in of the adjacent V element into BlockMix output (in SMix's second loop only, indeed). Although I see no shortcut from not doing this, I think it's a good thing to do, in part because the CPU might be fetching some or all of this data into cache anyway when we initiate the writes. That's because our writes are currently of 16 bytes per instruction, whereas cache lines are currently 64 bytes. Before we've issued 4 such write instructions, the CPU might already be fetching the rest of the cache line in. I guess there's logic in the CPU to avoid or reduce such needless activity when the writes are performed with nearly adjacent instructions, but I am not sure if ours are always grouped together closely enough as we're mixing crypto code and memory accesses. (BTW, this may also affect the first loop in SMix, which fills in the array for the first time.) New revision is attached. Comments are welcome. New MD5 of output of "tests" with defeat_tmto = 1: 1720dceef20e9d63dd526b717aa6d947 Alexander View attachment "escrypt-defeat-tmto-24to26.diff" of type "text/plain" (7907 bytes) View attachment "escrypt-defeat-tmto-26.diff" of type "text/plain" (16148 bytes) Download attachment "escrypt-0.0.26.tar.gz" of type "application/x-gzip" (10458 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.