john-dev - Re: LWS and GWS auto-tuning

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150826082411.GA4835@openwall.com>
Date: Wed, 26 Aug 2015 11:24:11 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

On Wed, Aug 26, 2015 at 11:06:17AM +0300, Solar Designer wrote:
> On Tue, Aug 25, 2015 at 08:36:44PM +0200, magnum wrote:
> > Worst/best 10 for Tahiti (oldoffice failing):
> 
> > Ratio:	0.83409 real, 1.02941 virtual	salted-sha1-opencl:Many salts
> 
> This one doesn't auto-tune.  Gives same speeds to me.
> 
> > Ratio:	0.85046 real, 0.85934 virtual	sxc-opencl, StarOffice .sxc:Raw
> > Ratio:	0.90867 real, 0.78394 virtual	blockchain-opencl, blockchain My 
> > Wallet:Raw
> 
> These two don't auto-tune.

I was wrong: they do.

> They use OpenMP.  With
> GOMP_CPU_AFFINITY=0-31, they give stable same speeds to me, regardless
> of recent changes (obviously).

Running these a few more times, I see that blockchain-opencl gives
unstable speeds.  There's also a weird discrepancy between c/s rates
printed during auto-tuning and the final benchmark:

Calculating best local worksize (LWS)
Testing LWS=64 GWS=524288 ... 84.860ms+
Testing LWS=128 GWS=524288 ... 84.612ms
Testing LWS=192 GWS=524160 ... 108.859ms
Testing LWS=256 GWS=524288 ... 84.863ms
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws:      2048     4410969 c/s     4410969 rounds/s 464.297us per crypt_all()!
gws:      4096     6655747 c/s     6655747 rounds/s 615.408us per crypt_all()+
gws:      8192    15338669 c/s    15338669 rounds/s 534.075us per crypt_all()!
gws:     16384     1634042 c/s     1634042 rounds/s  10.026ms per crypt_all()
gws:     32768    20423261 c/s    20423261 rounds/s   1.604ms per crypt_all()+
gws:     65536    21218723 c/s    21218723 rounds/s   3.088ms per crypt_all()+
gws:    131072    21881532 c/s    21881532 rounds/s   5.990ms per crypt_all()+
gws:    262144    22232899 c/s    22232899 rounds/s  11.790ms per crypt_all()+
gws:    524288    22359039 c/s    22359039 rounds/s  23.448ms per crypt_all()
gws:   1048576    22350637 c/s    22350637 rounds/s  46.914ms per crypt_all()
gws:   2097152    22066990 c/s    22066990 rounds/s  95.035ms per crypt_all()
gws:   4194304    22258786 c/s    22258786 rounds/s 188.433ms per crypt_all()
gws:   8388608    22233414 c/s    22233414 rounds/s 377.297ms per crypt_all()
gws:  16777216    21051147 c/s    21051147 rounds/s 796.973ms per crypt_all()
Local worksize (LWS) 64, global worksize (GWS) 262144
DONE
Raw:    6488K c/s real, 204736 c/s virtual

The auto-tuned GWS usually varies between 262144, 524288, and 1048576,
and the final speeds from ~5000K to ~6500K even for the same GWS.

Why is the discrepancy between ~22M while benchmarking and ~6M finally?
Is this how split kernel should manifest itself here?  I think not.

Looks like an issue unrelated to recent changes.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.