Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 21 Aug 2015 17:40:42 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

2015-08-20 22:34 GMT+02:00 Solar Designer <solar@...nwall.com>:
> Also, we're optimizing this blindfolded, and that's wrong.  We should be
> reviewing the generated code.  You may patch common-opencl.c:
> opencl_build_kernel_opt() to invoke opencl_build() like this:
>
>         opencl_build(sequential_id, opts, 1, "kernel.out");
>
> instead of the current:
>
>         opencl_build(sequential_id, opts, 0, NULL);
>
> Then when targeting NVIDIA cards it dumps PTX assembly to the filename
> specified there.  It looks something like this, just much larger:
>
> http://arrayfire.com/demystifying-ptx-code/
>
> You could start by experimenting with a much simpler than Argon2 yet in
> some ways similar kernel: implement some trivial operation like XOR on
> different vector widths and see whether/how this changes the assembly.
> Then make it slightly less trivial (just enough to prevent the compiler
> from optimizing things out) and add uses of private or local memory,
> and see if you can make it run faster by using wider vectors per the
> same private or local memory usage.
>

I tested (only 960m)
-copying memory from __private to __private
- from __global to __private
-xoring private tables with __prrivate tables

using ulong, ulong2, ulong4, ulong8 (I was getting empty kernel using ulong16)

in generated PTX code ulong4 and ulong8 were changed to ulong2

something like here (uong4):

    ld.global.v2.u64     {%rd73, %rd74}, [%rd926+8000];
    ld.global.v2.u64     {%rd77, %rd78}, [%rd926+8016];
    st.local.v2.u64     [%rd937+208], {%rd77, %rd78};
    st.local.v2.u64     [%rd937+192], {%rd73, %rd74};

I was getting the best speed on ulong ( except copying from global to private )


speeds:

xoring:

//1
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     1312 c/s real, 1312 c/s virtual
Only one salt:  1301 c/s real, 1312 c/s virtual
//2
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     590 c/s real, 590 c/s virtual
Only one salt:  595 c/s real, 595 c/s virtual

//4
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     914 c/s real, 914 c/s virtual
Only one salt:  906 c/s real, 898 c/s virtual
//8
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     738 c/s real, 731 c/s virtual
Only one salt:  738 c/s real, 738 c/s virtual

copying from global:

//1
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     853 c/s real, 860 c/s virtual
Only one salt:  860 c/s real, 860 c/s virtual

//2
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     1174 c/s real, 1185 c/s virtual
Only one salt:  1174 c/s real, 1163 c/s virtual

//4,8
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     1122 c/s real, 1113 c/s virtual
Only one salt:  1132 c/s real, 1132 c/s virtual


copying from private:

//1
*/
/*
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     2258 c/s real, 2236 c/s virtual
Only one salt:  2258 c/s real, 2258 c/s virtual*/

//2
/*
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     685 c/s real, 679 c/s virtual
Only one salt:  685 c/s real, 691 c/s virtual
*/


//4
/*
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     1153 c/s real, 1142 c/s virtual
Only one salt:  1163 c/s real, 1163 c/s virtual
*/

//8
/*Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 2, cost 2 (m) of 2
Many salts:     1796 c/s real, 1796 c/s virtual
Only one salt:  1812 c/s real, 1812 c/s virtual
*/

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.