john-dev - Re: Formats using non-SIMD SHA2 implementations

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4c5fdd7839e319e959705067e1a53a1f@smtp.hushmail.com>
Date: Tue, 18 Aug 2015 10:39:22 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Formats using non-SIMD SHA2 implementations

On 2015-08-18 10:30, magnum wrote:
> On 2015-08-18 05:51, Lei Zhang wrote:
>>> On Aug 18, 2015, at 9:43 AM, magnum <john.magnum@...hmail.com> wrote:
>>> RAR3 can also be tens of MB in size (per lane). But in early
>>> rar-opencl kernel I had it as just two full buffers: "unsigned char
>>> c[2*64]" (which was also in a union with other ways to describe it).
>>> Then I always wrote to buffer[index & 127]. Whenever I saw that I
>>> went into "the other" buffer, I called the digest function for the
>>> just filled buffer.
>>>
>>> I'm not sure I describe it very well %-)  Maybe looking at "git show
>>> 2972a53899:src/opencl/rar_kernel.cl" will show what I mean. That code
>>> did not use vectors but the idea will apply to SIMD CPU too. Very
>>> effective in terms of memory use.
>>
>> I viewed your code. It seems you only need to handle a single lane in
>> the kernel function. The problem in the SIMD code is that I have to
>> handle all lanes simultaneously. With your double buffers approach, I
>> need to call the digest function when buffers for all lanes are full,
>> but they might not be full at the same round. The buffers for some
>> lanes might be filled faster than other lanes, thus it's complicated
>> to determine at which points to call the digest function.
>>
>> Or perhaps I didn't understand your point correctly ?
>
> Oh, you are right. The most effective way of handling this case might be
> to sort lengths like in Jim's sha256crypt, and *then* do it like above.

BTW both 7z and rar3 has a property that makes the entire message length 
for all rounds guaranteed even divisible by 64. Current RAR3 kernel 
takes that opportunity to shuffle data much faster (trading global 
memory but it's still a lot faster) like this:

1. Prepare a buffer not of 2*64 but of smallest size that ends exactly 
at (len % 64 == 0).
2. When copying a limb, do it 16 x 32-bits at a time (btw this will 
benefit SIMD scattering even more than OpenCL!) from that aligned buffer.
3. After each step 2, just update the "rounds" bytes within the buffer 
prepared in step 1.

The above was the final idea that made our RAR3-opencl as fast as cRARk, 
which was my benchmark goal at the time.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.