john-dev - OpenCL for CPU (AMD vs. Intel)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cbb3706666d30d0b31c93dd9644fe162@smtp.hushmail.com>
Date: Sun, 26 Feb 2012 14:31:09 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: OpenCL for CPU (AMD vs. Intel)

On 02/23/2012 09:54 PM, magnum wrote:
> I had a look at Intel's OpenCL SDK today. This one makes your Intel CPU
> (or better, all CPUs and cores) work just like a GPU with OpenCL.

Intels OpenCL SDK works fine with an AMD CPU and vice versa, and there's
no problem installing both. Armed with the new --platform option in git
version of Jumbo, I can now do some comparisons.

Here's the output of --platform=LIST with both SDK versions installed:

$ ../run/john -platform=LIST
Platform #0 name: Intel(R) OpenCL
Platform version: OpenCL 1.1 LINUX
	Device #0 name:		Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz
	Device vendor:		Intel(R) Corporation
	Device type:		CPU
	Device version:		OpenCL 1.1 (Build 15293.6649)
	Driver version:		1.1
	Global Memory:		3855 MB
	Global Memory Cache:	3 MB
	Local Memory:		32 KB
	Max clock (MHz) :	2400
	Max Work Group Size:	1024
	Parallel compute cores:	2

Platform #1 name: AMD Accelerated Parallel Processing
Platform version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
	Device #0 name:		Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz
	Device vendor:		GenuineIntel
	Device type:		CPU
	Device version:		OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
	Driver version:		2.0
	Global Memory:		3855 MB
	Global Memory Cache:	0 MB
	Local Memory:		32 KB
	Max clock (MHz) :	2401
	Max Work Group Size:	1024
	Parallel compute cores:	2

So, let's try the phpass format:

$ ../run/john -test -form:phpass-opencl -platform=0
OpenCL Platforms: 2
OpenCL Platform: <<<Intel(R) OpenCL>>> 1 device(s), using device:
<<<Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz>>>
Compilation log: Build started
Kernel <phpass> was successfully vectorized
Done.
Optimal Group work Size = 64
Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... DONE
Raw:	17226 c/s real, 8613 c/s virtual

$ ../run/john -test -form:phpass-opencl -platform=1
OpenCL Platforms: 2
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 1 device(s),
using device: <<<Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz>>>
Optimal Group work Size = 8
Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... DONE
Raw:	6063 c/s real, 3051 c/s virtual

Intel's compiler says it vectorized phpass and it's indeed almost three
times faster than running under AMD. I'm not sure AMD will tell if/when
it vectorizes. I'm not even sure it will aver auto vectorize code?

$ ../run/john -test -form:cryptmd5-opencl
OpenCL Platforms: 2
OpenCL Platform: <<<Intel(R) OpenCL>>> 1 device(s), using device:
<<<Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz>>>
Compilation log: Build started
Kernel <cryptmd5> was not vectorized
Done.
Benchmarking: CRYPTMD5-OPENCL [MD5-based CRYPT]... DONE
Raw:	8694 c/s real, 4306 c/s virtual

$ ../run/john -test -form:cryptmd5-opencl -platform=1
OpenCL Platforms: 2
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 1 device(s),
using device: <<<Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz>>>
Benchmarking: CRYPTMD5-OPENCL [MD5-based CRYPT]... DONE
Raw:	9404 c/s real, 4714 c/s virtual

Here AMD beats Intel but both are slower than even the generic 32-bit code.

So let's try SHA-1:

$ ../run/john -test -form:mysql-sha1-opencl
OpenCL Platforms: 2
OpenCL Platform: <<<Intel(R) OpenCL>>> 1 device(s), using device:
<<<Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz>>>
Compilation log: Build started
Kernel <sha1_crypt_kernel> was not vectorized
Done.
Max Group Work Size 1024 Optimal local work size 128
(to avoid this test on next run do export LWS=128)
Local work size (LWS) 128, Keys per crypt (KPC) 2097152
Benchmarking: MySQL 4.1 double-SHA-1 [mysql-sha1-opencl]... DONE
Many salts:	1923K c/s real, 1069K c/s virtual
Only one salt:	1923K c/s real, 1064K c/s virtual

$ ../run/john -test -form:mysql-sha1-opencl -platform=1
OpenCL Platforms: 2
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 1 device(s),
using device: <<<Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz>>>
Max Group Work Size 1024 Optimal local work size 1024
(to avoid this test on next run do export LWS=1024)
Local work size (LWS) 1024, Keys per crypt (KPC) 2097152
Benchmarking: MySQL 4.1 double-SHA-1 [mysql-sha1-opencl]... DONE
Many salts:	2995K c/s real, 1638K c/s virtual
Only one salt:	2853K c/s real, 1638K c/s virtual

AMD wins again. The AMD figure is better than generic 32-bit code but
slower (per core) than generic 64-bit.

Figures might be a little better with an optimised keys-per-crypt but
that test will take an awful lot of time with the current scheme. Maybe
the default KPC should depend on reported number of cores. And maybe the
auto KPC should do a some kind of binary search for homing in. Start
low, double until it's slower, back off half the last increment and so on.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.