musl - Re: crypt_blowfish integration, optimization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120809223258.GW27715@brightrain.aerifal.cx>
Date: Thu, 9 Aug 2012 18:32:59 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: crypt_blowfish integration, optimization

On Fri, Aug 10, 2012 at 02:21:03AM +0400, Solar Designer wrote:
> On Thu, Aug 09, 2012 at 05:46:54PM -0400, Rich Felker wrote:
> > I've taken this version and made some minimum changes based on my
> > version, mainly for integration with musl where I'm testing it. I also
> > think we've reached the final word on loop unrolling:
> > 
> > Just For Fun, I tried replacing your unrolled BF_ROUND loop with a for
> > loop and compiling with -O3 on gcc 4.6.3. After noticing the
> > performance numbers were coming out near-identical, and that the .o
> > sizes were mysteriously identical, I decided, Just For Fun, to
> > disassemble both versions with objdump and diff them. They are
> > identical. That is, modern gcc generates byte-for-byte identical code
> > with -O3 for the manually unrolled loop and the for loop.
> 
> What about -O2?
> 
> -O3 is probably not what will be used for most musl builds, is it?
> 
> Hmm, for me "gcc -Q -O2 --help=optimizers" and ditto for -O3 both show
> "disabled" for -funroll-loops.  Why was the loop unrolled for you?

Not sure. I've found -Q --help=optimizers completely unreliable in the
past though. It only reports minimal differences between -Os, -O2, and
-O3, and trying to start with -O3 and reproduce -Os by just changing
the options that are different does not give effects even remotely
similar to -Os.

> Did you also have -funroll-loops specified explicitly?  If so, does this
> happen for normal musl builds?  I guess not?

No, I did not explicitly specify it. At present, -Os is default for
static libc and -O3 is default for shared libc. The reason for this
discrepency is that -fPIC generates a lot of size and speed bloat at
each function call, so the inlining from -O3 comes at reduced cost (it
eliminates wasteful prologue, compensating for some of the size
increase) and much greater performance benefits (again, from killing
prologue).

I've been thinking of making -O3 default across the board rather than
having different defaults for the two, which are ugly from a
build-system perspective, but some people are still against it even
though it's easy to override.

> As discussed, the problem with avoiding such hand-unrolls is that the
> compiler doesn't know just which loops are most important to unroll.

My experience has been that it tends to make good decisions overall,
and that if somebody is using -Os, they really want smallest size, not
performance.

> BTW, what speeds are you getting on your Atom?

I was clocking 0.573 seconds for one run with the 2^12 iterations on
one test, and about 4 million cycles per run with 2^4 iterations. This
is with my version of the code (essentially the same as yours;
compiled at -O3).

> How does this compare to
> the original crypt_blowfish-1.2 with asm code (both on 32-bit)?

I'll have to get the code and try it... The asm doesn't seem to have
ever been present in the code sent to the list.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.