|
|
Message-ID: <20130808213254.GB27069@openwall.com>
Date: Fri, 9 Aug 2013 01:32:54 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Intel OpenCL on CPU and MIC
magnum, all -
Attached are two logs produced by this command:
./john --list=formats | tr ', ' '\n' | grep -v '^$' | while read f; do echo "=== Testing $f on CPU ==="; ./john -te -form=$f; echo "Exit code: $?"; done 2>&1 | tee d0a.log; ./john --list=formats | tr ', ' '\n' | grep -v '^$' | while read f; do echo "=== Testing $f on MIC ==="; ./john -te -form=$f -dev=1; echo "Exit code: $?"; done 2>&1 | tee d1a.log
with bleeding-jumbo from last Friday, on a machine with 2xE5-2643
(8 cores, 16 logical CPUs total, 3.3 GHz) and Xeon Phi 3115A (57 cores,
1.1 GHz). The build was with linux-x86-64-opencl using gcc, with OpenMP
enabled and only one change to the sources:
In common-opencl.h, I changed:
#define OPENCLBUILDOPTIONS "-cl-strict-aliasing -cl-mad-enable"
to:
#define OPENCLBUILDOPTIONS "-cl-mad-enable"
because -cl-strict-aliasing was not understood by (and resulted in
errors from) Intel's OpenCL compiler. (Should we make this change
standard?)
I gzipped d1a.log because it was large.
As you can see, things are not good: our OpenCL formats either fail or
give poor speeds. Our regular C code is faster.
The errors and poor speeds seen on CPU you may reproduce on "well".
It currently has the same version of Intel OpenCL SDK:
intel_sdk_for_ocl_applications_2013_xe_sdk_3.0.67279_x64.tgz
For another test, I tried tuning MULTIPLIER in opencl_bf_std.h. I got
best speed with:
#define MULTIPLIER (1024*64)
and it was:
[solar@...tos6 src]$ GWS=65536 ../run/john -te -form=bcrypt-opencl -dev=1
get_device_type = 8 (CPU would be 2)
Device 1: Intel(R) Many Integrated Core Acceleration Card
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE
Raw: 6124 c/s real, 1092K c/s virtual
(the explicit GWS is just to speedup testing, and the "get_device_type =
8 (CPU would be 2)" line is my extra debugging output).
As you can see, this is also not a good speed - it's CPU-like, and is
low for an accelerator card. I am fairly certain that the kernel was
not vectorized, which is why the poor speed. I am getting roughly the
same cumulative speed by building the whole JtR right for Xeon Phi and
running it there. No OpenMP that way for a subtle reason that I can
explain separately, but I did run with --fork=228, for a cumulative
speed at bcrypt of around 6000 c/s.
For descrypt with 512-bit MIC intrinsics, I am getting a cumulative
speed of around 67M c/s for the 228 processes. Unfortunately, building
descrypt-opencl for MIC failed (see the log), so I can't compare these.
The MIC intrinsics for DES_bs_b.c are:
#elif defined(__MIC__) && DES_BS_DEPTH == 512
#include <immintrin.h>
typedef __m512i vtype;
#define vst(dst, ofs, src) \
_mm512_store_epi32((vtype *)((DES_bs_vector *)&(dst) + (ofs)), (src))
#define vxorf(a, b) \
_mm512_xor_epi32((a), (b))
#define vand(dst, a, b) \
(dst) = _mm512_and_epi32((a), (b))
#define vor(dst, a, b) \
(dst) = _mm512_or_epi32((a), (b))
#define vandn(dst, a, b) \
(dst) = _mm512_andnot_epi32((b), (a))
Alexander
View attachment "d0a.log" of type "text/plain" (56159 bytes)
Download attachment "d1a.log.gz" of type "application/x-gzip" (42839 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.