Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 9 Jul 2012 06:07:53 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: bf_kernel.cl (was: Sayantan:Weekly Report #11)

On Thu, Jul 05, 2012 at 12:11:14PM +0530, SAYANTAN DATTA wrote:
> Therfore
> your calculation the utilization of one out of four SIMD units per CU on
> 7970 is valid for current kernel.

Do you know if we currently utilize one of four SIMD units or maybe
1/4th of each SIMD unit (vector width) or another combination (e.g.,
maybe 8 out of 16 vector elements in two SIMD units, leaving the other
two completely idle)?

> I once tried two john builds together on 7970. One running on LDS and the
> other on global memory but it caused an asic hang. Maybe we need to merge
> them together under one build. But how to mearge them remains a question.
> One of the two possible way is to call two clEnqueKernels per crypt. Or we
> can merge the two kernels. Also how  the  two branches get scheduled on gpu
> will impact performance.

I'd try setting WORK_GROUP_SIZE to 12, keep the declaration of S_Buffer
at its current size - introduce some new macro for this, like
LDS_GROUP_SIZE, which we'd keep at 8 for 7970.  If lid is <
LDS_GROUP_SIZE, then use the current code.  If lid is >= LDS_GROUP_SIZE
(would be 8, 9, 10, or 11 under this example), then use new code that
would use global memory instead (just modify the supplied BF_current_S
directly?)

Meanwhile, attached is a quick hack that uses simpler addressing modes
(maybe, depending on what the code is compiled into).  No significant
performance change on 7970 from this (but the source code size is
reduced), yet you could want to benchmark this more carefully.  There
appears to be a 10% slowdown on GTX 570 from this, but that's with
non-optimal settings (the same as 7970's), so it might not be relevant.
You could want to play with this too.

Also, the 512-iterations loop in BF_body() could be partially unrolled -
maybe try a 2x unroll first (256 iterations of the new loop).  Maybe
this would let a few instructions of one iteration of the original loop
be intermixed with instructions from the other iteration, hiding some
latencies.

Thanks,

Alexander

View attachment "john-bf_kernel-Sptr.diff" of type "text/plain" (4856 bytes)

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ