john-dev - Re: Got all dyna formats (except $1$ and $apr1$) working with OMP

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130418212510.3JDIK.36145.imail@eastrmwml206>
Date: Thu, 18 Apr 2013 21:25:10 -0400
From:  <jfoug@....net>
To: john-dev@...ts.openwall.com
Subject: Re: Got all dyna formats (except $1$ and $apr1$) working
 with OMP


---- magnum <john.magnum@...hmail.com> wrote: 
> On 18 Apr, 2013, at 1:51 , jfoug <jfoug@....net> wrote:
> > I have posted a 6000+ line patch to magnum.  I was able to get gains better than what I hoped for dyna.  I have almost all issues worked out.
> 
> Latest code in relbench (dynamic only). This is non-OMP build vs. 4xOMP, real i7 cores:
> non-OMP build vs. OMP build running one 1 core:
> 
> Number of benchmarks:		94
> Minimum:			0.44944 real, 0.44944 virtual
> Maximum:			1.04394 real, 1.02326 virtual
> Median:				0.93713 real, 0.93713 virtual
> Median absolute deviation:	0.06278 real, 0.06210 virtual
> Geometric mean:			0.86919 real, 0.86863 virtual
> Geometric standard deviation:	1.23741 real, 1.23690 virtual
> 
> This is bad and maybe we can make it better. Ideally, an OMP build running one core should be in par with a non-OMP build.

Not in the way dyna OMP is put together.  

nonOMP simply walks the array of primitive Dyna-functions.  Each step is run one time, and does THAT step for all candidates.  So, if there are 4 steps in the script, there is 4 function calls.  Now, that IS 4 'primitive' calls to only 120 (SSE para-3) candidates, which is 0.0333 primitive calls per candidate (in this example).

In OMP mode, the layout of the primitive Dyna-functions has been changed.  Instead of void func(void), they are been changed to void func(int start, int stop).  Also, the candidates are 48x more (scale).  The way OMP was done in dyna, is that each thread will be given a range of candidates. So each thread will run all 4 steps back to back, BUT will only perform the steps over the threads given partial range.  So, for 1x OMP, if it is an SSE 3x para build, there will be 4*48x120/12, which is 1920 primitive calls, to do 5760 candidates, which comes 0.33333 primitive calls per candidate (i.e. 10x more). Also the primitives have 2 params, where before they had none.

This likely is not the only overhead.  doing small parts of the data work at a time, where before, the data was manipulated in a tight loop, all doing the same thing, until it was done.  Now, I do not know how much of a difference this makes, and have no real way to test the timings, but I do bet this makes SOME changes.

Now, for the good things.  Dynamic is a VERY complicated format.  There is a LOT of overhead in the format, copying data in, out and reformatting it.  This is why I was concerned that OMP would give little to no gains.  But the way I put this all together, all of the work done in crypt_all (running the script), is done inside OMP.  There is no switching in/out, even if there are 6 data movement steps, and a couple of crypt steps.  It is all done in a single thread.

Now, it is possible that I 'could' get the 1x omp closer, by computing the 'best' candidate splitting increment, so that only OMP-num-threads loops are done within the OMP loop.  This has to be 'careful', since the split must happen on an OMP-para, or just mmx_coef if non para, or on an even boundary if MD5_X2.  But doing it like this, the above example would split up 5760 candidates into 4 sets of 1440 candidates, and each thread would then work on just those candidates, all threads should finish close to each other (doing same work).  1440 is divisible by 12 (so para-3 works fine).  I can see just what changes this makes, but I bet it will help overall, and should change the 1x OMP into almost the same as non-OMP build. There would only be a main thread and it would loop once.

Let me give this a shot, but it should be easy to do.  There may be 'some' formats which do have a performance hit.  The way that a few globals were handled, is that I allocate an array of these globals large enough for the number of threads. Then, when accessing them I get the omp_thread_num() function. I am not sure of the overhead of that function., but I did not want to build an array of 5760 of these, because these vars get reset each crypt_all.  Setting an array of 4 is no overhead.  Setting 5760 is huge, especially if the format does not use them.    The one that gets most use, is the var telling the string copying to change into utf8, (also upcase, lowcase, but those are not inline in the string movement functions).

NOTE, the OMP is very new. Many implementation/design items may change, as we find bottlenecks.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.