john-dev - Intel OpenCL on CPU and MIC

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130808213254.GB27069@openwall.com>
Date: Fri, 9 Aug 2013 01:32:54 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Intel OpenCL on CPU and MIC

magnum, all -

Attached are two logs produced by this command:

./john --list=formats | tr ', ' '\n' | grep -v '^$' | while read f; do echo "=== Testing $f on CPU ==="; ./john -te -form=$f; echo "Exit code: $?"; done 2>&1 | tee d0a.log; ./john --list=formats | tr ', ' '\n' | grep -v '^$' | while read f; do echo "=== Testing $f on MIC ==="; ./john -te -form=$f -dev=1; echo "Exit code: $?"; done 2>&1 | tee d1a.log

with bleeding-jumbo from last Friday, on a machine with 2xE5-2643
(8 cores, 16 logical CPUs total, 3.3 GHz) and Xeon Phi 3115A (57 cores,
1.1 GHz).  The build was with linux-x86-64-opencl using gcc, with OpenMP
enabled and only one change to the sources:

In common-opencl.h, I changed:

#define OPENCLBUILDOPTIONS "-cl-strict-aliasing -cl-mad-enable"

to:

#define OPENCLBUILDOPTIONS "-cl-mad-enable"

because -cl-strict-aliasing was not understood by (and resulted in
errors from) Intel's OpenCL compiler.  (Should we make this change
standard?)

I gzipped d1a.log because it was large.

As you can see, things are not good: our OpenCL formats either fail or
give poor speeds.  Our regular C code is faster.

The errors and poor speeds seen on CPU you may reproduce on "well".
It currently has the same version of Intel OpenCL SDK:

intel_sdk_for_ocl_applications_2013_xe_sdk_3.0.67279_x64.tgz

For another test, I tried tuning MULTIPLIER in opencl_bf_std.h.  I got
best speed with:

#define MULTIPLIER			(1024*64)

and it was:

[solar@...tos6 src]$ GWS=65536 ../run/john -te -form=bcrypt-opencl -dev=1
get_device_type = 8 (CPU would be 2)
Device 1: Intel(R) Many Integrated Core Acceleration Card
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Raw:    6124 c/s real, 1092K c/s virtual

(the explicit GWS is just to speedup testing, and the "get_device_type =
8 (CPU would be 2)" line is my extra debugging output).

As you can see, this is also not a good speed - it's CPU-like, and is
low for an accelerator card.  I am fairly certain that the kernel was
not vectorized, which is why the poor speed.  I am getting roughly the
same cumulative speed by building the whole JtR right for Xeon Phi and
running it there.  No OpenMP that way for a subtle reason that I can
explain separately, but I did run with --fork=228, for a cumulative
speed at bcrypt of around 6000 c/s.

For descrypt with 512-bit MIC intrinsics, I am getting a cumulative
speed of around 67M c/s for the 228 processes.  Unfortunately, building
descrypt-opencl for MIC failed (see the log), so I can't compare these.

The MIC intrinsics for DES_bs_b.c are:

#elif defined(__MIC__) && DES_BS_DEPTH == 512
#include <immintrin.h>

typedef __m512i vtype;

#define vst(dst, ofs, src) \
	_mm512_store_epi32((vtype *)((DES_bs_vector *)&(dst) + (ofs)), (src))

#define vxorf(a, b) \
	_mm512_xor_epi32((a), (b))

#define vand(dst, a, b) \
	(dst) = _mm512_and_epi32((a), (b))
#define vor(dst, a, b) \
	(dst) = _mm512_or_epi32((a), (b))
#define vandn(dst, a, b) \
	(dst) = _mm512_andnot_epi32((b), (a))

Alexander

View attachment "d0a.log" of type "text/plain" (56159 bytes)

Download attachment "d1a.log.gz" of type "application/x-gzip" (42839 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.