Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 14 Aug 2015 16:44:09 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

2015-08-14 16:37 GMT+02:00 Agnieszka Bielec <bielecagnieszka8@...il.com>:
> 2015-08-14 15:31 GMT+02:00 Solar Designer <solar@...nwall.com>:
>> On Thu, Aug 13, 2015 at 12:28:57AM +0200, magnum wrote:
>>> On 2015-08-12 23:51, Solar Designer wrote:
>>> >magnum, do you have an explanation why the best benchmark result during
>>> >auto-tuning is usually substantially different from the final benchmark
>>> >in most of Agnieszka's formats?  I'm fine with eventually dismissing it
>>> >as "hard to achieve" and "cosmetic anyway", but I'd like to understand
>>> >the cause first.  Thanks!
>>>
>>> Generally a mismatch could be caused by using different [cost] test
>>> vectors in auto-tune than the ones benchmarked, or auto-tune using just
>>> one repeated plaintext in a format where length matters for speed (eg.
>>> RAR), or something along those lines.
>>>
>>> Another reason would be incorrect setup of autotune for split kernels.
>>> For example, if auto-tune thinks we're going to call a split kernel 500
>>> times but the real run does it 1000 times, we'll see inflated figures
>>> from autotune.
>>>
>>> A third reason (seen in early WPA-PSK) is when crypt_all() does
>>> significant post-processing on CPU where auto-tune doesn't.
>>
>> At least the first reason you listed may likely result in suboptimal
>> auto-tuning.  Perhaps it wouldn't with simple iterated schemes like
>> PBKDF2, but with memory-hard schemes like Argon2 the cost settings do
>> affect optimal LWS and GWS substantially.
>>
>> So we shouldn't dismiss this without understanding of what exactly is
>> going on in a given case.
>
>
> cracking mode on my laptop on argon2d showed that at the beginning
> speed is the same to this showed during computing gws, after some time
> I am getting speed closest to showed during --test but it's not
> exactly the same.
>
> beggining
> 0g 0:00:00:05 13.67% 2/3 (ETA: 16:00:32) 0g/s 3922p/s 3922c/s 3922C/s
> GPU:56°C util:99% leugim..nolfet
>
> after 1 min
> 0g 0:00:03:25  3/3 0g/s 4067p/s 4067c/s 4067C/s GPU:77°C util:99% 213160..241144
>
> after 5 min
> 0g 0:00:07:40  3/3 0g/s 4083p/s 4083c/s 4083C/s GPU:78°C util:45%
> critas01..crachera
>
> --test
>
> Local worksize (LWS) 64, global worksize (GWS) 512
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
> Many salts:     4114 c/s real, 4077 c/s virtual
> Only one salt:  4114 c/s real, 4114 c/s virtual
>
> I don't have big differences with argon2i on my laptop
>
> on super:
>
> [a@...er run]$ ./john --test --format=argon2i-opencl --v=4
> Benchmarking: argon2i-opencl [Blake2 OpenCL]...
> memory per hash : 1.46 MB
> Device 0: Tahiti [AMD Radeon HD 7900 Series]
> Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
> -DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
> -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
> Calculating best global worksize (GWS); max. 1s single kernel invocation.
> gws:       256         385 c/s         385 rounds/s 663.846ms per crypt_all()!
> gws:       512         719 c/s         719 rounds/s 711.475ms per crypt_all()+
> gws:      1024        1298 c/s        1298 rounds/s 788.748ms per crypt_all()+
> Local worksize (LWS) 64, global worksize (GWS) 1024
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
> Many salts:     390 c/s real, 102400 c/s virtual
> Only one salt:  390 c/s real, 102400 c/s virtual
>
> cracking run shows
>
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:21 6.61% 2/3 (ETA: 17:03:04) 0g/s 385.3p/s 385.3c/s
> 385.3C/s fireballs..bens
> GPU 0 overheat (33816176°C, fan 0%), aborting job.
> 0g 0:00:00:21 6.61% 2/3 (ETA: 17:03:04) 0g/s 384.0p/s 384.0c/s
> 384.0C/s fireballs..bens
>
> so speeds reported by main --test are good

wtf?

[a@...er run]$ ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts:     423 c/s real, 102400 c/s virtual
Only one salt:  423 c/s real, 102400 c/s virtual

[a@...er run]$ ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws:       256         387 c/s         387 rounds/s 659.830ms per crypt_all()!
gws:       512         720 c/s         720 rounds/s 710.817ms per crypt_all()+
gws:      1024        1305 c/s        1305 rounds/s 784.470ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts:     389 c/s real, 102400 c/s virtual
Only one salt:  386 c/s real, 51200 c/s virtual

[a@...er run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.46 MB
Device 0: Tahiti [AMD Radeon HD 7900 Series]
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
Many salts:     1296 c/s real, 204800 c/s virtual
Only one salt:  1304 c/s real, 204800 c/s virtual

this can have something common with MEM_SIZE/4 (now I have removed /4)
http://www.openwall.com/lists/john-dev/2015/08/06/22
sorry, couldn't find my original e-mail

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.