|
Date: Tue, 5 May 2015 20:05:50 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: [GSoC] John the Ripper support for PHC finalists Agnieszka, Going forward please mention specific branches and commits in your messages, without me having to ask, so that I can take a look at your code sooner. I've just skimmed over d3db03708e2a3b30e9ac3954b42082b6ac6e87c3 in the interleaving branch at https://github.com/Lucife-r/JohnTheRipper I notice that you use 2x interleaving factor with SSE2, but 4x with AVX2. Why this specific choice? If you were trying to match the SIMD vector width, then that's flawed logic. On the contrary, with wider SIMD vectors a smaller interleaving factor might work better if the cache sizes stay the same. I suggest that you try 2x interleaving with AVX2. There's little point in going for 4x without having seen speedup with 2x first. Then, you're unnecessarily relying on compiler optimizations too much. You've turned random_number, index_local, and index_global from variables into arrays, and you're hoping(?) that the compiler will allocate groups of 2 or 4 registers for them anyway. Well, it might or it might not. Please use simple variables, like the original code did. So random_number0, random_number1, etc. You're using explicit indices anyway, so this won't complicate your code. Please omit or rework index_global_t. There's no point in precomputing values like "index_global_t[3]=index_global[3]+3;" when you're then only using them like "S[index_global_t[3]]". If S[] elements were no larger than 8 bytes each, then the CPU's addressing modes would enable e.g. S+i*8+24 to be calculated during effective address calculation in the load instruction at no extra cost. This doesn't work for elements larger than 8 (and yours are __m256i), so it makes sense to precompute them multiplied by sizeof(__m256i), and then access the data via a macro that would do the proper typecasts to use byte offsets. Not only for index_global_t, but also for i0 and index_local*, so that the multiplication by sizeof(__m256i) (shift left by 5) would be performed less frequently, and then +32, +64, +96, etc. would be added to it. On Sat, May 02, 2015 at 06:14:05AM +0200, Agnieszka Bielec wrote: > I made interleaving for no-SIMD, SSE2 and AVX2 version, the speed for > costs 2,2 and 0,0 is slightly better but for costs 6,6 and 8,8 is > worse, so I'm not sure if I did everything correctly. Given POMELO's use of memory, interleaving might in fact be of little help, as the more memory you use at once, the slower the memory accesses become as you're getting further out of cache. I think this is why you're not seeing a speedup with only your initial implementation, not optimized yet. You might or might not see more of a speedup when you implement optimizations such as what I suggested above. Also, please try 2x rather than 4x interleaving for AVX2. With 4x, we might be increasing our working set size unnecessarily and the register pressure might be too high. Remember that on x86_64 we have only 16 SIMD registers plus 16 scalar registers. If our code needs more, it starts spilling registers to memory (well, usually to L1 cache indeed). I suggest that you review the generated assembly code without and with interleaving. See if extra instructions get generated (such as spilling registers to memory and loading them back). Also, find those left shifts that are used to calculate byte offsets from indices. See if any can be avoided or moved to outer loops. Perhaps some of these optimizations can also be made to non-interleaved code (and even submitted back to the author of POMELO). > Maybe it's because we have bigger gaps between chunks of data in memory No, I think the memory layout is fine. When different cache lines are accessed, it does not matter how large or small the gap between their currently cached addresses is. However, I suggest that you align the memory allocations to be on cache line boundary. Right now, you align them to 32 bytes as AVX2 requires, but our cache lines are 64 bytes. Crossing a cache line boundary unnecessarily has performance cost and it thrashes other valuable data out of cache (it thrashes two cache lines instead of just one). Oh, and in the SSE2/AVX code you're not aligning the memory allocation of S at all, so you only get the current malloc()'s guaranteed 16-byte alignment. This might or might not happen to also be 64-byte aligned. You should explicitly make it at least 64-byte aligned. Preferably, align S to page boundary (4096 bytes). And do this for non-interleaved implementations too, for fair benchmarks against them. > I uploaded my code on the branch "interleaving" Oh, I missed this at first. :-( > I'm including results: > > well SSE2 interleaving (I forgot to change algorithm_name to "SSE2") As magnum correctly suggested, this should be automatic. Also, it should be reported as "AVX" when #ifdef __AVX__, because in that case the compiler generates AVX instructions for the same SSE2 intrinsics. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.