crypt-dev - Re: scrypt TMTO defeater

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130317215304.GA10559@openwall.com>
Date: Mon, 18 Mar 2013 01:53:04 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: scrypt TMTO defeater

Colin, Anthony, all -

On Sun, Mar 17, 2013 at 08:03:21AM +0400, Solar Designer wrote:
> So I've tried to implement an enhancement like what I described above:
> writing not only to V_j, but also to V_i (with another value).
> Unfortunately, it turns out that BlockMix's shuffling(*) gets in the way
> of efficient implementations, and _possibly_ the inefficiency mostly hurts
> cleaner CPU code, as opposed to hacks and ASICs.  Specifically, when the
> V element we write into is or might be the same as V_j, in a clean
> implementation we have to postpone those writes until the BlockMix
> completes, and this means re-reading the same values from memory
> (hopefully, from L1 cache) rather than simply storing register values
> into two locations at once (such as X and V_j).

Essentially, we'd have a hard to optimize out blkcpy(), which is bad in
that we're not doing any crypto on the CPU while it runs.  I think we
prefer to have a fine mix of crypto code and memory accesses, which makes
greater use of CPU resources.

> In order to do the
> latter without introducing extra checks, I had to limit the writes to
> elements that are definitely not same as V_j; I chose (~j & (N - 1)) for
> their indices.

Here's what I think is a better idea: instead of "NOT j (mod N)", use
"j XOR 1" (assumes N >= 2).  This way, with halved r (e.g., moving from
r=8 to r=4 when enabling the TMTO defeater) we maintain the same TLB
pressure (since the V element to update is then on the same memory page,
if r was such that one V element fit in a page at all).  As a bonus,
this avoids the extra "& (N - 1)".

Also, rather than merely overwrite V elements, we can use them first.
I've implemented XOR'ing in of the adjacent V element into BlockMix
output (in SMix's second loop only, indeed).  Although I see no shortcut
from not doing this, I think it's a good thing to do, in part because
the CPU might be fetching some or all of this data into cache anyway
when we initiate the writes.  That's because our writes are currently
of 16 bytes per instruction, whereas cache lines are currently 64 bytes.
Before we've issued 4 such write instructions, the CPU might already be
fetching the rest of the cache line in.  I guess there's logic in the
CPU to avoid or reduce such needless activity when the writes are
performed with nearly adjacent instructions, but I am not sure if ours
are always grouped together closely enough as we're mixing crypto code
and memory accesses.  (BTW, this may also affect the first loop in SMix,
which fills in the array for the first time.)

New revision is attached.  Comments are welcome.

New MD5 of output of "tests" with defeat_tmto = 1:
1720dceef20e9d63dd526b717aa6d947

Alexander

View attachment "escrypt-defeat-tmto-24to26.diff" of type "text/plain" (7907 bytes)

View attachment "escrypt-defeat-tmto-26.diff" of type "text/plain" (16148 bytes)

Download attachment "escrypt-0.0.26.tar.gz" of type "application/x-gzip" (10458 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.