Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 8 Aug 2012 10:27:06 +0400
From: Solar Designer <solar@...nwall.com>
To: musl@...ts.openwall.com
Subject: Re: crypt* files in crypt directory

On Wed, Aug 08, 2012 at 01:28:44AM -0400, Rich Felker wrote:
> On Wed, Aug 08, 2012 at 08:42:35AM +0400, Solar Designer wrote:
> > I see that you did this - and I think you took it too far.  The code
> > became twice slower on Pentium 3 when compiling with gcc 3.4.5 (approx.
> > 140 c/s down to 77 c/s).  Adding -finline-functions
> > -fold-unroll-all-loops regains only a fraction of the speed (112 c/s);
> > less aggressive loop unrolling results in lower speeds.
> 
> Can you compare with a more modern gcc?

I could and I might do that later, but to me the slowdown with gcc 3 is
enough reason not to make those changes in that specific way.

> > The impact on x86-64 is less.  With Ubuntu 12.04's gcc 4.6.3 on FX-8120
> > I get 490 c/s for the original code, 450 c/s for your code without
> > inlining/unrolling, and somehow only 430 c/s with -finline-functions
> > -funroll-loops.
> 
> Actually this is a lot closer to what I expected. I think you'll find
> similar results on 32-bit with gcc 4.6.3 too. The modern expectation
> is that manually unrolling loops will give worse performance than
> letting the compiler decide what to do. Certainly there are exceptions
> to the expected result, but on average, it's the right decision.

Per the numbers above, here the compiler's unroll is slower not only
than manual unroll, but also than non-unrolled code.

> Even if it's twice as slow, that should only be the cost of
> incrementing the (logarithmic) iteration count by one).

Yes, and I think this is significant.

> The size difference between the versions is roughly 50%

It doesn't have to be.  There are 6 instances of BF_ENCRYPT in
BF_crypt().  I am only asking you to revert to their larger form the two
that are inside BF_body.  The remaining 4 may remain as calls to a
function.  Alternatively, all 6 may be function calls, but then the
function's BF_ENCRYPT should be a fully manually unrolled one.  I am not
sure which of these options will be faster overall for typical settings
(we'd need to benchmark these at $2a$08).

> (7k vs 11.5k with -Os
> and roughly 9k vs 13.5k with -O3). Yes one can argue that the
> difference doesn't matter for one particular component they especially
> care about,

Exactly.

> but everyone cares about something different, and in the
> end the whole library ends up 50% larger if you follow that to its
> logical end.

Makes sense.

> I'd much rather stick with letting the compiler do the
> bloating-up for performance purposes if the user wants it, so that
> the choice is left to them.

Maybe you could support -DFAST_CRYPT or the like.  It could enable
forced inlining and manual unrolls in crypt_blowfish.c.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.