Date: Thu, 9 Aug 2012 18:32:59 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: crypt_blowfish integration, optimization On Fri, Aug 10, 2012 at 02:21:03AM +0400, Solar Designer wrote: > On Thu, Aug 09, 2012 at 05:46:54PM -0400, Rich Felker wrote: > > I've taken this version and made some minimum changes based on my > > version, mainly for integration with musl where I'm testing it. I also > > think we've reached the final word on loop unrolling: > > > > Just For Fun, I tried replacing your unrolled BF_ROUND loop with a for > > loop and compiling with -O3 on gcc 4.6.3. After noticing the > > performance numbers were coming out near-identical, and that the .o > > sizes were mysteriously identical, I decided, Just For Fun, to > > disassemble both versions with objdump and diff them. They are > > identical. That is, modern gcc generates byte-for-byte identical code > > with -O3 for the manually unrolled loop and the for loop. > > What about -O2? > > -O3 is probably not what will be used for most musl builds, is it? > > Hmm, for me "gcc -Q -O2 --help=optimizers" and ditto for -O3 both show > "disabled" for -funroll-loops. Why was the loop unrolled for you? Not sure. I've found -Q --help=optimizers completely unreliable in the past though. It only reports minimal differences between -Os, -O2, and -O3, and trying to start with -O3 and reproduce -Os by just changing the options that are different does not give effects even remotely similar to -Os. > Did you also have -funroll-loops specified explicitly? If so, does this > happen for normal musl builds? I guess not? No, I did not explicitly specify it. At present, -Os is default for static libc and -O3 is default for shared libc. The reason for this discrepency is that -fPIC generates a lot of size and speed bloat at each function call, so the inlining from -O3 comes at reduced cost (it eliminates wasteful prologue, compensating for some of the size increase) and much greater performance benefits (again, from killing prologue). I've been thinking of making -O3 default across the board rather than having different defaults for the two, which are ugly from a build-system perspective, but some people are still against it even though it's easy to override. > As discussed, the problem with avoiding such hand-unrolls is that the > compiler doesn't know just which loops are most important to unroll. My experience has been that it tends to make good decisions overall, and that if somebody is using -Os, they really want smallest size, not performance. > BTW, what speeds are you getting on your Atom? I was clocking 0.573 seconds for one run with the 2^12 iterations on one test, and about 4 million cycles per run with 2^4 iterations. This is with my version of the code (essentially the same as yours; compiled at -O3). > How does this compare to > the original crypt_blowfish-1.2 with asm code (both on 32-bit)? I'll have to get the code and try it... The asm doesn't seem to have ever been present in the code sent to the list. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.