Date: Fri, 29 May 2015 01:22:10 -0400 From: Alain Espinosa <alainesp@...ta.cu> To: john-dev@...ts.openwall.com Subject: Re: bitslice SHA-256 -------- Original message -------- From: Solar Designer <solar@...nwall.com> Date:05/29/2015 12:32 AM (GMT-05:00) To: john-dev@...ts.openwall.com Cc: Alain Espinosa <alainesp@...ta.cu> Subject: Re: [john-dev] bitslice SHA-256 ...This is 3 ADDs in a row. We don't have to treat them as arbitrary 3 ADDs. Instead, we can treat them as one 4-input ADD. I expect that once we've optimized it, the total instruction count for 4-input ADD will be lower than 15. I consider this, but not found how to reduce the instructions count. ...On top of this, if we didn't already have a bitselect exposed in the original expression (like we do here), some of the logical operations from our implementations of the ADDs might be merged with nearby logical operations from the original expression producing more complicated yet single-instruction logical operations such as bitselect or AVX-512/Maxwell ternary logic operations. Yes, see my last email where I consider Neon and AVX512. ...I briefly experimented with merged ADDs in this md5slice.c revision I will take a look. ...add32c() is a 3-input ADD where one of the inputs is a constant I check this code searching how to reduce sum instructions count. If I understand it correctly you use more than 5 for one add (more than 10 for 2, if I recall correctly you use 11). ...Finally, much of the estimated 27% increase in instruction count comes from extra loads. But extra loads are not necessarily extra instructions. On x86, we have load-op instructions (where one of the inputs is in memory), and the SIMD ones typically execute in a single cycle (when the data is in L1 cache) - just as fast as instructions with both inputs in registers. This works as long as at most 2 out of 3 instructions on a given clock cycle are load-ops, since we only have 2 read ports from L1 data cache. I think 2 out of 3 is good enough for our needs here. Agreed, but for example Neon don't, and I was trying to make the more general comparison. ...Alain, I think you might be the person who could pull this off, given the additional advice and encouragement above. :-) I am not sure of this. I can test it in Neon, but not in AVX512 that I think is more important. I will check your link, but as i say I try to reduce sum instructions count (considering that they are chaining) and not found an explicit (for me) way to do it. Regards, Alain Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.