Date: Fri, 9 Aug 2013 01:32:54 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Intel OpenCL on CPU and MIC magnum, all - Attached are two logs produced by this command: ./john --list=formats | tr ', ' '\n' | grep -v '^$' | while read f; do echo "=== Testing $f on CPU ==="; ./john -te -form=$f; echo "Exit code: $?"; done 2>&1 | tee d0a.log; ./john --list=formats | tr ', ' '\n' | grep -v '^$' | while read f; do echo "=== Testing $f on MIC ==="; ./john -te -form=$f -dev=1; echo "Exit code: $?"; done 2>&1 | tee d1a.log with bleeding-jumbo from last Friday, on a machine with 2xE5-2643 (8 cores, 16 logical CPUs total, 3.3 GHz) and Xeon Phi 3115A (57 cores, 1.1 GHz). The build was with linux-x86-64-opencl using gcc, with OpenMP enabled and only one change to the sources: In common-opencl.h, I changed: #define OPENCLBUILDOPTIONS "-cl-strict-aliasing -cl-mad-enable" to: #define OPENCLBUILDOPTIONS "-cl-mad-enable" because -cl-strict-aliasing was not understood by (and resulted in errors from) Intel's OpenCL compiler. (Should we make this change standard?) I gzipped d1a.log because it was large. As you can see, things are not good: our OpenCL formats either fail or give poor speeds. Our regular C code is faster. The errors and poor speeds seen on CPU you may reproduce on "well". It currently has the same version of Intel OpenCL SDK: intel_sdk_for_ocl_applications_2013_xe_sdk_3.0.67279_x64.tgz For another test, I tried tuning MULTIPLIER in opencl_bf_std.h. I got best speed with: #define MULTIPLIER (1024*64) and it was: [solar@...tos6 src]$ GWS=65536 ../run/john -te -form=bcrypt-opencl -dev=1 get_device_type = 8 (CPU would be 2) Device 1: Intel(R) Many Integrated Core Acceleration Card Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Raw: 6124 c/s real, 1092K c/s virtual (the explicit GWS is just to speedup testing, and the "get_device_type = 8 (CPU would be 2)" line is my extra debugging output). As you can see, this is also not a good speed - it's CPU-like, and is low for an accelerator card. I am fairly certain that the kernel was not vectorized, which is why the poor speed. I am getting roughly the same cumulative speed by building the whole JtR right for Xeon Phi and running it there. No OpenMP that way for a subtle reason that I can explain separately, but I did run with --fork=228, for a cumulative speed at bcrypt of around 6000 c/s. For descrypt with 512-bit MIC intrinsics, I am getting a cumulative speed of around 67M c/s for the 228 processes. Unfortunately, building descrypt-opencl for MIC failed (see the log), so I can't compare these. The MIC intrinsics for DES_bs_b.c are: #elif defined(__MIC__) && DES_BS_DEPTH == 512 #include <immintrin.h> typedef __m512i vtype; #define vst(dst, ofs, src) \ _mm512_store_epi32((vtype *)((DES_bs_vector *)&(dst) + (ofs)), (src)) #define vxorf(a, b) \ _mm512_xor_epi32((a), (b)) #define vand(dst, a, b) \ (dst) = _mm512_and_epi32((a), (b)) #define vor(dst, a, b) \ (dst) = _mm512_or_epi32((a), (b)) #define vandn(dst, a, b) \ (dst) = _mm512_andnot_epi32((b), (a)) Alexander View attachment "d0a.log" of type "text/plain" (56159 bytes) Download attachment "d1a.log.gz" of type "application/x-gzip" (42839 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.