Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 18 Aug 2015 11:51:26 +0800
From: Lei Zhang <>
Subject: Re: Formats using non-SIMD SHA2 implementations

> On Aug 18, 2015, at 9:43 AM, magnum <> wrote:
> On 2015-08-18 02:08, Lei Zhang wrote:
>> The problem in 7z is that the message to be hashed needs to be constructed first.
>> The original scalar code (simplified):
>> for (round = 0; round < rounds; ++round) {
>>     SHA256_Update(&ctx, password, len);
>>     SHA256_Update(&ctx, &round, sizeof(round));
>> }
>> The 'rounds' is a really big number, and a small difference in 'len' might results in very different length of the whole message. As a result, I cannot pre-tell how many limbs are needed.
>> One way is to construct the whole message before feeding it on-fly to a small vector buffer; or use a large vector buffer to construct the message in-place. Either way I cannot avoid using a large buffer. The optimal way might be to construct the message on-fly while feeding it to a small vector buffer. This is theoretically doable, but I found it too tricky to implement... So I chose to use a large vector buffer (~30MB for a non-OpenMP build), which works ok so far.
> RAR3 can also be tens of MB in size (per lane). But in early rar-opencl kernel I had it as just two full buffers: "unsigned char c[2*64]" (which was also in a union with other ways to describe it). Then I always wrote to buffer[index & 127]. Whenever I saw that I went into "the other" buffer, I called the digest function for the just filled buffer.
> I'm not sure I describe it very well %-)  Maybe looking at "git show 2972a53899:src/opencl/" will show what I mean. That code did not use vectors but the idea will apply to SIMD CPU too. Very effective in terms of memory use.

I viewed your code. It seems you only need to handle a single lane in the kernel function. The problem in the SIMD code is that I have to handle all lanes simultaneously. With your double buffers approach, I need to call the digest function when buffers for all lanes are full, but they might not be full at the same round. The buffers for some lanes might be filled faster than other lanes, thus it's complicated to determine at which points to call the digest function.

Or perhaps I didn't understand your point correctly ?


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ