musl - Re: crypt* files in crypt directory

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120809033613.GA24926@openwall.com>
Date: Thu, 9 Aug 2012 07:36:13 +0400
From: Solar Designer <solar@...nwall.com>
To: musl@...ts.openwall.com
Subject: Re: crypt* files in crypt directory

On Wed, Aug 08, 2012 at 05:48:55PM -0400, Rich Felker wrote:
> On Wed, Aug 08, 2012 at 09:03:00AM +0200, Daniel Cegie??ka wrote:
> > > Maybe you could support -DFAST_CRYPT or the like.  It could enable
> > > forced inlining and manual unrolls in crypt_blowfish.c.
> > >
> > > Alexander
> > 
> > This can be a very sensible solution.
> 
> Unless there's a really compelling reason to do so, I'd like to avoid
> having multiple alternative versions of the same code in a codebase.
> It makes it so there's more combinations you have to test to be sure
> the code works and doesn't have regressions.

This makes sense.

> As it stands, the code I posted with the manual unrolling removed
> performs _better_ than the manually unrolled code with gcc 4 on x86_64
> when optimized for speed,

For me, it does not.  The opposite is true.

> and it's 33% smaller when optimized for size.

I think we can arrive at almost the same size reduction without such
loss in speed.

BTW, I did not mention it yet, but the asm code that you dropped
probably performed a lot faster on Atom.  (This is based on user reports
for similar code in JtR.)  Anyone with an Atom to test?

> As for being slower on gcc 3, there's already much more
> performance-critical code that's significantly slower on musl+gcc3
> than on glibc due to gcc3 badness, for example all of the
> endianness-swapping functions (byteswap.h and htonl,etc. in netdb.h).

Not everyone will agree that those things you mention are much more
performance-critical.

> Really the only place where crypt performance is critical is in JtR,

As you point out below, it is also relevant when doing authentication.

> and there you're using your own optimized code internal to JtR, right?

Yes.  In fact, JtR's code is even faster since it processes multiple
candidate passwords at once (per logical CPU) for greater
instruction-level parallelism - something we can't reasonably do here.

> Even if crypt is half-speed on gcc3 without the manual unrolling, that
> still only makes a 1-order-of-magnitude (base 2) difference to the
> iterations you can use while keeping the same responsiveness/load,
> i.e. not nearly enough to make or break somebody's ability to crack
> your hashes. (In general, as long as you don't try to iterate this
> principle, an attacker who can afford N time (or N cores) can also
> afford 2*N time (or 2*N cores).)

Yes, but that's not the only relevant threat model.  It is also common
(or even more common) for the attacker to apply a fixed amount of
resources (whatever they happen to have and/or are willing to use) and
get a certain percentage of passwords cracked.  This percentage is
(slightly) affected by the cost of computing each password hash.  For a
(small) subset of the users, this will be the difference between their
password having been cracked vs. not.

Another way to think of it is that an average password's entropy may be,
say, 40 bits with some not-too-strict password policy in place (30 bits
without).  A 1-bit difference in the amount of password stretching we
provide is thus similar to a 2.5% difference in password entropy (or
more without password policy).  Passwords would need to be accordingly
more complicated to compensate for that, or they'd be weaker by this much.

For, say, 1 million users (e.g. total across multiple systems with
musl), that's 1 megabit of human memory - and I think that's a lot.
We can't directly compare this to the code size increase on those systems
(e.g., a few kilobytes per system or even per crypt-using program with
static linking), but clearly human memory is a far more precious
resource than computer memory.

> Aside from my own feelings on the matter, I'm trying to consider the
> impressions it makes on our user base. I've already taken some heat
> for replacing the heap sort qsort code in musl with smoothsort,
> despite it being a lot faster in a task where performance is generally
> important, and the size difference was less than this crypt unrolling.

Oh, I don't recall that.  Was it on this list?

> When someone frustrated with bloat sees hand-unrolled loops, their
> first reaction is "eew, this code is bloated".

I am frustrated with bloat, but hand-unrolled loops in crypto code, in
memcpy(), etc. look just right to me.

There's no compiler option to ask it to unroll only the truly
performance-critical loops.

> My intent with
> modernizing (and fixing the stack usage) of the old DES crypt code was
> to save enough space that we could get the new algorithms (blowfish,
> md5, sha) integrated without much (or even any, if possible) size
> increase versus the old bad DES code. I think this makes a difference
> for "selling" the idea of supporting all these algorithms to the
> anti-bloat faction of musl's following.

Understood.

How about we (almost) achieve your size goal for crypt_blowfish (7 KB
including tables), yet keep the hand-unrolling?  I think this is
possible, albeit maybe with some tricks.

Is use of GNU C extensions (also supported in clang) acceptable?  I am
thinking of a "Labels as Values" trick to share more code.  This would
make the source code less readable, though.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.