john-dev - Re: bf

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120710060236.GA6348@openwall.com>
Date: Tue, 10 Jul 2012 10:02:36 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bf_kernel.cl

On Tue, Jul 10, 2012 at 11:16:17AM +0530, Sayantan Datta wrote:
> I remember that during actual cracking how speed were limited to somewhere
> near 1000 c/s on the kernel using global memory although benchmarking
> suggested much higher 2400c/s. This suggest that we were incurring stalls
> during actual cracking which we weren't during benchmarking.  I think this
> is the ultimate which we can achieve using global memory.

Oh, I think I was not aware of the lower speed during actual cracking.

The speed difference could be due to benchmarks having only a handful of
candidate passwords to test, and testing them repeatedly.  So we get
multiple instances of the same candidate password in-flight at a given
time, presumably resulting in accesses to similar memory locations.

Yet this is puzzling since the S-boxes are all separate and read-write.
Even if similar (mod 4 KB) addresses are accessed and the same data is
being read/written, this shouldn't allow for better cache usage than if
the data were different.  In fact, the similarity in addresses could
result in more bank conflicts.

So we could want to investigate this.

> Also I could achive nearly the same numbers using global memory alone
> despite of heavily under utilizing the CU. I limited global no. of work
> items to 512 and work group size to 8 which produced 1019 c/s in actual
> cracking.
> 
> This puts my revised value of x to be 4 not 8. So we will see upto 25%
> extra using global memory.

Makes sense.

> One more thing I would like you to know that your Sptr implemntation
> performs nearly same as before on nvidia after a 4x loop unroll of the 512
> iteration loop.

This also makes sense.  Are you committing this change?  I think it
makes the code simpler, although it needs 3 extra registers per bcrypt
instance.  We should have plenty of spare registers since we're
under-utilizing the GPU anyway (assuming that this OpenCL code is being
run on a GPU).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.