Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 26 May 2013 19:37:55 -0400
From: Yaniv Sapir <>
Subject: Re: Parallella: bcrypt


Here's a couple command line options you can try when compiling the code.
Please look at the manual for further details.

-mfp-mode=int        # this sets the FPU mode to integer. However, please
make sure that the generated code does not re-program the CONFIG register
before every integer operation
-Ofast               # e-gcc supports this level too
-funroll-loops       # unroll the loops for better performance
-falign-loops=8      # align the body of the loop to an 8-byte boundary
-falign-functions=8  # same, but for functions entry point
-ffast-math          # really, a FP option, but you may gain something here

In no circumstances should you run the compute intensive code (and data)
from external memory. This has an impact that can't be compensated by
optimization options.

If you find out that the code and data set won't fit in, try limiting the
amount of loop unrolling and see the effect. Use:


Best strategy is to compile your functions in separate modules, then apply
optimization switches that best match your needs. Usually when optimizing
globally, you pay with unnecessarily bloated coed.

On Sun, May 26, 2013 at 4:48 PM, Solar Designer <> wrote:

> On Sun, May 26, 2013 at 09:38:00PM +0200, Katja Malvoni wrote:
> > ORIGINAL with *-O2*
> >       Message from eCore 0x88a ( 2, 2): Result:
> > "$2a$05$XXXXXXXXXXXXXXXXXXXXXOAcXxm9kjPGEMsLznoKqmqw7tc8WCx4a"#
> >       Execution time - Epiphany: 42.688000 ms
> OK, this is slightly better, but still 4x slower than our target speed.
> A 2x difference is explained by non-use of the integer instructions on
> the FPU (of those, perhaps we'd be able to use just IADD, though).
> A further 2x difference is mostly not explained... we'd need to take a
> look at the code, or maybe we need to try inter-mixing two instances of
> bcrypt first, as it helps to hide instruction latencies.
> A minor optimization you'd need to make is compute only the first 64
> bits of output (one Blowfish block size) on Epiphany.  When these happen
> to match the first 64 bits of a bcrypt hash loaded for cracking (which is
> likely to happen only if we have attempted the correct password), JtR
> will compute the full hash on host CPU in cmp_exact() (just to make sure
> we do indeed have the correct password and the hash loaded for cracking
> had not been corrupted).  This will provide very little speedup, though.
> Anyhow, your immediate next task is making use of all cores.  Even at
> only 23 c/s per core as above, 16 cores will do ~370 c/s, which is the
> speed of a Pentium 4 (but at a much lower power consumption), although
> new multi-core x86 CPUs are roughly 10x faster.  Indeed, 370 c/s at
> $2a$05 is not a serious speed for actual use currently, but we're merely
> experimenting with this new technology.  Hopefully, we'll be able to
> improve the speed to up to 1600 c/s on 16 Epiphany cores, and hopefully
> we'll have 64-core and larger Epiphany chips later.
> Thanks,
> Alexander

Yaniv Sapir
Adapteva Inc.
1666 Massachusetts Ave, Suite 14
Lexington, MA 02420
Phone: (781)-328-0513 (x104)
CONFIDENTIALITY NOTICE: This e-mail may contain information
that is confidential and proprietary to Adapteva, and Adapteva hereby
designates the information in this e-mail as confidential. The information
 intended only for the use of the individual or entity named above. If you
not the intended recipient, you are hereby notified that any disclosure,
distribution or use of any of the information contained in this
transmission is
strictly prohibited and that you should immediately destroy this e-mail and
contents and notify Adapteva.

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.