Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Fri, 29 May 2015 01:22:10 -0400
From: Alain Espinosa <alainesp@...ta.cu>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice SHA-256



-------- Original message --------
From: Solar Designer <solar@...nwall.com> 
Date:05/29/2015 12:32 AM (GMT-05:00) 
To: john-dev@...ts.openwall.com 
Cc: Alain Espinosa <alainesp@...ta.cu> 
Subject: Re: [john-dev] bitslice SHA-256 

...This is 3 ADDs in a row.  We don't have to treat them as arbitrary 3
ADDs.  Instead, we can treat them as one 4-input ADD.  I expect that
once we've optimized it, the total instruction count for 4-input ADD
will be lower than 15.

I consider this, but not found how to reduce the instructions count.

...On top of this, if we didn't already have a
bitselect exposed in the original expression (like we do here), some of
the logical operations from our implementations of the ADDs might be
merged with nearby logical operations from the original expression
producing more complicated yet single-instruction logical operations
such as bitselect or AVX-512/Maxwell ternary logic operations.

Yes, see my last email where I consider Neon and AVX512. 

...I briefly experimented with merged ADDs in this md5slice.c revision

I will take a look.

...add32c() is a 3-input ADD where one of the inputs is a constant

I check this code searching how to reduce sum instructions count. If I understand it correctly you use more than 5 for one add (more than 10 for 2, if I recall correctly you use 11).

...Finally, much of the estimated 27% increase in instruction count comes
from extra loads.  But extra loads are not necessarily extra
instructions.  On x86, we have load-op instructions (where one of the
inputs is in memory), and the SIMD ones typically execute in a single
cycle (when the data is in L1 cache) - just as fast as instructions with
both inputs in registers.  This works as long as at most 2 out of 3
instructions on a given clock cycle are load-ops, since we only have 2
read ports from L1 data cache.  I think 2 out of 3 is good enough for
our needs here.

Agreed, but for example Neon don't, and I was trying to make the more general comparison. 

...Alain, I think you might be the person who could pull this off, given
the additional advice and encouragement above. :-)

I am not sure of this. I can test it in Neon, but not in AVX512 that I think is more important. I will check your link, but as i say I try to reduce sum instructions count (considering that they are chaining) and not found an explicit (for me) way to do it.

Regards, 
Alain
Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.