Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 23 Mar 2013 11:14:06 -0300
From: Claudio André <claudioandre.br@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Idea to increase plaintext length for GPU based hashes

Em 19-03-2013 17:27, magnum escreveu:
> On 19 Mar, 2013, at 16:16 , Claudio André <claudioandre.br@...il.com> wrote:
>> Em 18-03-2013 22:21, magnum escreveu:
>>> Another approach (not necessarily mutex to yours) would be to split the transfer. Let's say we have a work size of 1M. At, say, the 256K'th call to set_key(), it could initiate a transfer of this first fourth of keys to GPU. This transfer will not stall the host side, it will take place while we continue with the next 256K keys. And so on. If we can balance this properly we should get rid of much of the transfer delay. Maybe we should split it in 8 or 16, maybe less.
>> Well, it is easy to (somehow) implement your idea. Good gain.
>> Raw:    13653K c/s real, 47953K c/s virtual
>> Raw:    17096K c/s real, 36408K c/s virtual
> I hoped for a lot more :-(
>
>> But, this not enought to do the trick. Below, I measure only the GPU part in auto-tune. I expect to be at 113M not 17M.
>> Is it only set_key() to blame? Btw: auto-tune on unstable goes at 19M.
> I tried the same on ntlmv2-opencl with various split sizes but there was no gain at all.
>
> Regarding set_key() bottleneck:
>
> magnum@...l:src [bleeding-jumbo]$ ../run/john -t -fo:lm
> Benchmarking: LM DES [128/128 BS XOP-16]... DONE
> Raw:    60458K c/s real, 60458K c/s virtual
>
> ...if we can get 60M c/s using one CPU core, we should beat that figure for any raw GPU format if we can hide transfer latency. We need some real profiling.
>
> magnum

My guess is that we are close to hardware limits in 'efficiency'. At 
6770, AMD SDK tool reports 0.85 GB/s. It is what we are getting on JtR 
(with recent ideas).

Analysis attached.





Platform found : Advanced Micro Devices, Inc.


Device  0            Juniper
Build:               DEBUG
GPU work items:      16384
Buffer size:         33554432
CPU workers:         1
Timing loops:        20
Repeats:             1
Kernel loops:        1
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 
copyBuffer:          CL_MEM_READ_WRITE CL_MEM_USE_HOST_PTR 


LOOP ITERATIONS
---------------

Loop 0


          PCIe B/W host->device:  0.052167 s       0.64 GB/s
          PCIe B/W device->host:  0.039224 s       0.86 GB/s


Loop 1


          PCIe B/W host->device:  0.040531 s       0.83 GB/s
          PCIe B/W device->host:  0.039240 s       0.86 GB/s


Loop 2


          PCIe B/W host->device:  0.040495 s       0.83 GB/s
          PCIe B/W device->host:  0.039255 s       0.85 GB/s


Loop 3


          PCIe B/W host->device:  0.040539 s       0.83 GB/s
          PCIe B/W device->host:  0.039232 s       0.86 GB/s


Loop 4


          PCIe B/W host->device:  0.040487 s       0.83 GB/s
          PCIe B/W device->host:  0.039253 s       0.85 GB/s


Loop 5


          PCIe B/W host->device:  0.040499 s       0.83 GB/s
          PCIe B/W device->host:  0.039229 s       0.86 GB/s


Loop 6


          PCIe B/W host->device:  0.040506 s       0.83 GB/s
          PCIe B/W device->host:  0.039218 s       0.86 GB/s


Loop 7


          PCIe B/W host->device:  0.040542 s       0.83 GB/s
          PCIe B/W device->host:  0.039251 s       0.85 GB/s


Loop 8


          PCIe B/W host->device:  0.040520 s       0.83 GB/s
          PCIe B/W device->host:  0.039219 s       0.86 GB/s


Loop 9


          PCIe B/W host->device:  0.040544 s       0.83 GB/s
          PCIe B/W device->host:  0.039217 s       0.86 GB/s


Loop 10


          PCIe B/W host->device:  0.040503 s       0.83 GB/s
          PCIe B/W device->host:  0.039222 s       0.86 GB/s


Loop 11


          PCIe B/W host->device:  0.040505 s       0.83 GB/s
          PCIe B/W device->host:  0.039227 s       0.86 GB/s


Loop 12


          PCIe B/W host->device:  0.040524 s       0.83 GB/s
          PCIe B/W device->host:  0.039241 s       0.86 GB/s


Loop 13


          PCIe B/W host->device:  0.040468 s       0.83 GB/s
          PCIe B/W device->host:  0.039224 s       0.86 GB/s


Loop 14


          PCIe B/W host->device:  0.040522 s       0.83 GB/s
          PCIe B/W device->host:  0.039232 s       0.86 GB/s


Loop 15


          PCIe B/W host->device:  0.040558 s       0.83 GB/s
          PCIe B/W device->host:  0.039280 s       0.85 GB/s


Loop 16


          PCIe B/W host->device:  0.040510 s       0.83 GB/s
          PCIe B/W device->host:  0.039224 s       0.86 GB/s


Loop 17


          PCIe B/W host->device:  0.040508 s       0.83 GB/s
          PCIe B/W device->host:  0.039224 s       0.86 GB/s


Loop 18


          PCIe B/W host->device:  0.040548 s       0.83 GB/s
          PCIe B/W device->host:  0.039228 s       0.86 GB/s


Loop 19


          PCIe B/W host->device:  0.040529 s       0.83 GB/s
          PCIe B/W device->host:  0.039220 s       0.86 GB/s


AVERAGES (over loops 2 - 19, use -l for complete log)
--------

          PCIe B/W host->device:  0.040517 s       0.83 GB/s
          PCIe B/W device->host:  0.039233 s       0.86 GB/s



Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ