john-dev - Re: MPI-openCL build; fast hashes are faster!!

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+TsHUAiu9b6Eup-t2e+2-JoXrB74-wWOY+g1AUi6=2rvcQfbA@mail.gmail.com>
Date: Thu, 28 Mar 2013 09:50:04 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: MPI-openCL build; fast hashes are faster!!

Hi ,

On Thu, Mar 28, 2013 at 9:24 AM, Milen Rangelov <gat3way@...il.com> wrote:

> Yes, this approach is easier and I took that as well. Instead of trying to
> use async stuff and events just spawn several threads, each with own
> context and buffers then do the stuff in parallel on a single device. It
> wastes a lot of memory indeed, but it's easy and it works. For fast hashes
> with candidate generation on GPUs it's even better :)
>
>
> On Thu, Mar 28, 2013 at 3:45 AM, magnum <john.magnum@...hmail.com> wrote:
>
>> On 27 Mar, 2013, at 22:54 , Sayantan Datta <std2048@...il.com> wrote:
>> > OpenCL-MPI build using openmpi results in some of the fast hashes to
>> speed up by 1.5-3x without any extra modification  e.g.
>> raw-md4,raw-md5,raw-sha1 ,xsha512 and des. I haven't tested all hashes but
>> most of them works fine except the slow hashes which causes asic hangs most
>> of the time with anything more than 1 thread feeding the GPU.
>>
>> You mean with all processes using one same GPU? On what branch was that?
>> Recent improvements in raw-md4/md5 in bleeding-jumbo has reduced transfer
>> latencies a lot so that approach should be less rewarding.
>>
>> magnum
>>
>
>
Here's some benchmark results:

bleeding-jumbo:

Single thread:
sayantan@...n:~/Jtr/JohnTheRipper-bleeding-jumbo/run$ mpiexec -np 1 ./john
-te -pla=1 -dev=0 -fo=raw-md4-opencl
Device 1: Tahiti (AMD Radeon HD 7900 Serie)
Benchmarking: Raw MD4 [OpenCL (inefficient, development use only)]... Local
worksize (LWS) 128, global worksize (GWS) 4194304
DONE
Raw:    61622K c/s real, 94219K c/s virtual

6 threads:
sayantan@...n:~/Jtr/JohnTheRipper-bleeding-jumbo/run$ mpiexec -np 6 ./john
-te -pla=1 -dev=0 -fo=raw-md4-opencl
Device 1: Tahiti (AMD Radeon HD 7900 Serie)
Device 1: Tahiti (AMD Radeon HD 7900 Serie)
Device 1: Tahiti (AMD Radeon HD 7900 Serie)
Device 1: Tahiti (AMD Radeon HD 7900 Serie)
Device 1: Tahiti (AMD Radeon HD 7900 Serie)
Device 1: Tahiti (AMD Radeon HD 7900 Serie)
OpenCL error (CL_MAP_FAILURE) in file (opencl_rawmd4_fmt.c) at line (87) -
(Error mapping page-locked memory saved_plain)
OpenCL error (CL_MAP_FAILURE) in file (opencl_rawmd4_fmt.c) at line (87) -
(Error mapping page-locked memory saved_plain)
OpenCL error (CL_MAP_FAILURE) in file (opencl_rawmd4_fmt.c) at line (87) -
(Error mapping page-locked memory saved_plain)
OpenCL error (CL_MAP_FAILURE) in file (opencl_rawmd4_fmt.c) at line (87) -
(Error mapping page-locked memory saved_plain)
OpenCL error (CL_MAP_FAILURE) in file (opencl_rawmd4_fmt.c) at line (87) -
(Error mapping page-locked memory saved_plain)
OpenCL error (CL_MAP_FAILURE) in file (opencl_rawmd4_fmt.c) at line (87) -
(Error mapping page-locked memory saved_plain)

Any explanation on why this error is occuring? As far as I know each MPI
thread creates it own private pool of memory ,so threads shouldn't be
interfering with each other's pinned memories.


On unstable-jumbo:

Single thread:
sayantan@...n:~/Jtr/JohnTheRipper-unstable-jumbo/run$ mpiexec -np 1 ./john
-te -fo=raw-md4-opencl
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Device 0: Tahiti (AMD Radeon HD 7900 Serie)
Local worksize (LWS) 64, Global worksize (GWS) 2097152
Benchmarking: Raw MD4 [OpenCL (inefficient, development use only)]... DONE
Raw:    34952K c/s real, 45782K c/s virtual

6 threads:
sayantan@...n:~/Jtr/JohnTheRipper-unstable-jumbo/run$ mpiexec -np 6 ./john
-te -fo=raw-md4-opencl
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Device 0: Tahiti (AMD Radeon HD 7900 Serie)
Device 0: Tahiti (AMD Radeon HD 7900 Serie)
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Device 0: Tahiti (AMD Radeon HD 7900 Serie)
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Device 0: Tahiti (AMD Radeon HD 7900 Serie)
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Device 0: Tahiti (AMD Radeon HD 7900 Serie)
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Device 0: Tahiti (AMD Radeon HD 7900 Serie)
Local worksize (LWS) 64, Global worksize (GWS) 2097152
Local worksize (LWS) 64, Global worksize (GWS) 2097152
Benchmarking: Raw MD4 [OpenCL (inefficient, development use only)]...
(6xMPI) Local worksize (LWS) 64, Global worksize (GWS) 2097152
Local worksize (LWS) 256, Global worksize (GWS) 2097152
Local worksize (LWS) 256, Global worksize (GWS) 2097152
Local worksize (LWS) 256, Global worksize (GWS) 2097152
DONE
Raw:    140853K c/s real, 262144K c/s virtual

This is almost 4x jump compared to single threaded unstable and nearly 2.5x
compared to single threaded bleeding. :)

Regards,
Sayantan

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.