Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 4 Sep 2015 10:43:54 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: md5crypt-opencl

On Thu, Sep 03, 2015 at 11:36:28PM +0200, Lukas Odzioba wrote:
> 2015-09-02 19:32 GMT+02:00 Lukas Odzioba <lukas.odzioba@...il.com>:
> > 2015-09-02 19:04 GMT+02:00 Solar Designer <solar@...nwall.com>:
> >> Oh, is it possibly because the array is two-dimensional?  Like some
> >> heuristic: "put all arrays with greater than one dimension in global
> >> memory".  It is probably worth trying to turn the array into
> >> single-dimensional and see.
> >
> > Who knows, I'll be happy to give it a try.
> 
> Performace is the same with 1 dimensional array, so I suppose that's
> not the way to go,

Yeah, I no longer expected the single-dimensional array to help when I
noticed that even the tiny altpos[] array gets placed in global memory.

Another guess was that byte-sized accesses were causing the array to be
placed in global memory, due to possible unavailability of such access
modes for VGPRs (I don't recall whether this is the case or not).
However, I've also since ruled this out (at least as the only cause), by
changing the kernel such that there was no longer a single byte-sized
access left in the generated ISA code.

> but the code is not slightly simpler.

You mean, it _is_ slightly simpler?

> From what I recall there was no way to fit all ctx's with decent LWS value.
> Since some ctx's are more often used than the others my idea was to
> move those hot to the local memory and keep the rest in global.
> Another loose idea was to try to "preload" next ctx to the local
> memory and do writeback after that, but I have no idea whether it
> makes sense at all with not so long computations as we have in md5.

It might make sense.

Another idea is even more compact representation, with two-level
indirection.  So that md5_digest() would read its input words via
indices stored in another (tiny) array (or 14 individual variables
maybe, or fewer if storing several indices per variable in different bit
positions) in private memory.  Such 8 sets of 14 indices may be made
smaller than the current 8 sets of 14 32-bit words, and it would allow
for most of the 32-bit data words to be shared between the 8 sets.

> Here is my patch, but I suppose it will be easier to modify current
> code and we should keep it.

Thanks.  I already have a revision of the code as well.  Curiously, I
also have those "uint *ctx_buffer" things under that same name.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.