john-dev - Re: Parallella: Litecoin mining

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130904050424.GB23413@openwall.com>
Date: Wed, 4 Sep 2013 09:04:24 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: Litecoin mining

Rafael,

On Wed, Sep 04, 2013 at 03:27:05AM +0100, Rafael Waldo Delgado Doblas wrote:
> Well it's was running without any crashes for almost 4 hours and if I force
> to find a share by ignoring the scrypt hash and making it 0, all looks fine
> (of course the share is rejected). However I cannot find any share, can be
> it because the work is discarded before Epiphany can find a valid share?

I don't know what you mean by "the work is discarded" here.  What
component do you think discards the work, and why?

You should get your modified cgminer to run to a point where the pool
does accept a share.

> In addition as you asked this is the work that I perform today:

Thanks!

> I implemented a couple salsa20_8 asm versions:
> The first one with bucles "Bor[i] = Bout[i] = (B[i] ^ Bx[i]);" and "Bout[i]
> += Bor[i];" rolled and using the instruction imadd, it save about 250B but
> the performance drops almost 0.5khash/s
> The second keep unrolled the bucles and uses the instruction imadd, it save
> only 50B but the performance also drops almost 0.5khash/s.

This is fine.

> At this point looks like imadd instrucction it so slow to be used but roll
> the bucle could be nice.

Wrong conclusions.  Please re-read:

http://www.openwall.com/lists/john-dev/2013/08/29/4

"Slow" is non-informative.  IMADD has high latency (you'd call this
"slow"), but it also has high throughput (we may call this "fast").
This means that with proper instruction scheduling it can be fast.  When
you use inline asm for just one rotate operation, you make the IMADD and
its latency opaque to gcc.  As a result, gcc is not enabled to produce
good instruction scheduling.  Additionally, your use of just one
register for the temporary value does not allow for multiple rotate
operations to overlap (mixing their instructions), but clearly with this
inline asm approach gcc would not perform this optimization anyway
because, once again, the piece of inline asm is opaque to it (as far as
gcc is aware, it's just a string, not a piece of code that gcc could
possibly inter-mix with another piece of code).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.