Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 31 Jul 2013 23:25:21 +0200
From: Katja Malvoni <kmalvoni@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Hi Alexander, Yaniv,

Alexander, I'm currently working on everything you mentioned so far. I'm
using cpp macros, P arrays are preloaded using ldrd, str instructions
replaced by strd with post increment and rts is added. I'm using two
structs but both are still in the same struct called shared_buffer. And I
have an interesting situation - I have a code which isn't reliable
(sometimes fails self test), but when it works I get very weird speed:
497619 c/s (it's not constant but it's 49xxxx, both real and virtual). I am
testing bcrypt-parallella format, I only changed how data is transferred
and how result is read (separated structs for input and output, I still
haven't implemented savings when salt or keys aren't changed). I don't
understand this speed. If I measure time with transfers it's around 0.05
ms. But with unoptimized bcrypt, speed of computing the hash without
transfers was around 16.5 ms. If I read whole outputs struct and than use
memcpy to have result in parallella_BF_out speed is 1204 c/s. Code which
gives this very high speed is committed.

Yaniv, how should I use two separate structs? In matmul example, only one
is used. I tried allocating memory with size of both struct and I was
hoping that if I put both in "shared_dram" they would be placed one after
another. But that wasn't the case. I used e-read -1 1000000 10000 and I
haven't found the other struct (I was searching for correct results and
done flags which are very easy to spot or for 320 bytes of zeroes in case
that something went wrong and nothing was written).

Thanks,

Katja


On Wed, Jul 31, 2013 at 5:07 AM, Solar Designer <solar@...nwall.com> wrote:

> Katja,
>
> On Wed, Jul 31, 2013 at 06:11:51AM +0400, Solar Designer wrote:
> > In fact, since at least one of "salt" or "keys" should change between
> > crypt_all() calls almost all of the time, and since we prefer to do just
> > one transfer, we will always be transferring the "salt".  Thus, the
> > salt_changed flag's only use is to save on reading the salt from
> > external memory on the Epiphany side if the salt has not changed.  Other
> > than that, you can always transfer the salt from the host.
>
> Note that with current JtR interfaces there's only one "salt" for all
> cores.  Right now, your code unnecessarily uses per-core (and even
> per-instance!) salts.  You can save on only transferring and loading one
> salt from external memory (and having other cores read that one salt
> from one core's local memory).
>
> Also, set_key() in parallella_bf_fmt.c is simple enough that you can
> have it on Epiphany, thereby reducing the data transfer size and having
> the 32 instances of set_key() execute in parallel.  (On host, you will
> then use simpler set_key() that will merely buffer the data, without
> expansion to the 72-char strings).
>
> You combining both inputs and outputs in one struct is weird/wrong.
> I think you should have separate struct's for transfer to Epiphany (one
> struct) and for transfer of the results back (another struct).  Right
> now, you have:
>
> typedef struct
> {
>         BF_binary result[MAX_KEYS_PER_CRYPT];
>         int core_done[EPIPHANY_CORES];
>         BF_key init_key[MAX_KEYS_PER_CRYPT];
>         BF_key exp_key[MAX_KEYS_PER_CRYPT];
>         BF_salt setting[EPIPHANY_CORES];
>         int start[EPIPHANY_CORES];
> }data;
>
> Its size is something like:
>
> 8*32+4*16+72*2*32+(16+4+4)*16+4*16 = 5376 bytes
>
> You can instead have:
>
> typedef struct {
>         BF_word salt[4];
>         unsigned char rounds;
>         unsigned char flags; /* bit 0 keys_changed, bit 1 salt_changed */
>         unsigned char subtype; /* only if you move set_key() to Epiphany */
>         int start1[EPIPHANY_CORES];
>         BF_key init_key[MAX_KEYS_PER_CRYPT];
>         int start2[EPIPHANY_CORES];
> } inputs;
>
> 2452 bytes, and only 80 bytes are transferred most of the time (when
> keys are not changed, and thus you can exclude init_key).  I added
> start2[] so that you can have the start flags at the end of the struct
> (confirming its full transfer) when you do transfer the keys.
>
> typedef struct {
>         BF_binary result[MAX_KEYS_PER_CRYPT];
>         int core_done[EPIPHANY_CORES];
> } outputs;
>
> 320 bytes.
>
> BTW, in parallella_e_bcrypt.c you currently have the unused array
> flags_by_subtype[] - just drop it.
>
> Alexander
>



-- 
Katja

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.