Date: Sun, 31 May 2015 11:19:57 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics On 2015-05-30 04:55, Solar Designer wrote: > On Fri, May 29, 2015 at 10:52:41PM +0200, magnum wrote: >> Here's a GitHub issue where we discuss interleaving and present some >> benchmarks: >> https://github.com/magnumripper/JohnTheRipper/issues/1217 > magnum, what exactly did you benchmark for these tables? - > > https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106695709 > > I fail to make sense of them, except that they seem to show interleaving > not helping. It's unclear what hash type this is - raw or something > else. Are the 8x, 14x, and 6x the numbers of OpenMP threads? If so, > why 14x? Was max_keys_per_crypt kept constant or was it increasing along > with increased interleaving (if so, we might have been exceeding L1 data > cache size for higher interleaving factors)? It's pbkdf2-hmac-sha2 (using testparas.sh now in tree but subject to change). The 14x was on a 16xHT machine which wasn't idling, it had a pretty static 100% load on one core. I know 14x isn't a perfect test but it should be consistent. The sha256 format has OMP_SCALE_4, perhaps I should re-test with 1. > https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106839338 > > These are reasonable results for pbkdf2-hmac-sha512, but there's > something "wrong" for pbkdf2-hmac-sha256. It is suspicious whenever > there's a performance regression for going from x1 to x2 interleaving, > but then a performance improvement at higher interleaving factors. This > suggest there's some overhead incurred with interleaving, and it is > probably avoidable. Perhaps the para loops doesn't always unroll in sha256 and we end up with actual loops, as discussed below? The overhead would be less significant for higher paras. >> I think you will have some educated thoughts about this; Here's part of >> our current SHA-1: >> (...) >> This file is -O3 (from a pragma) so I guess both cases will be unrolled >> but there is obviously a big difference after just unrolling. Assuming a >> perfect optimizer it wouldn't matter but assuming a non-perfect one, is >> the former better? I'm guessing SHA-1 was written that way for a reason? > > I wasn't involved in writing either. I think JimF was first to > introduce the *_PARA_DO stuff to JtR. FWIW I'm pretty sure Simon did that the first ones. > To me, both of these depend on compiler optimization too much. When I > interleave stuff, I use non-array temporary variables and explicit > unrolling via macros - see BF_std.c, MD5_std.c, and php_mt_seed. I have > little experience with how well or not the approach with arrays and > loops works across different compilers/versions - I only benchmarked > these formats in JtR just like others in here did. > <lots snipped> Thanks, lots of food for thought. Lei, Are you with us here? > magnum wrote: >> One difference is that the former is almost guaranteed to be fully >> unrolled why the latter might not. > > I am not so sure. I think these might be at similar risk of not being > unrolled. But the SHA1 version basically puts a loop around single instructions. I'd be angry with gcc if it doesn't unroll that at -O3. But anyway we should try to establish how it ends up and why. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.