Date: Fri, 19 Apr 2013 03:40:02 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Got all dyna formats (except $1$ and $apr1$) working with OMP On 19 Apr, 2013, at 3:25 , <jfoug@....net> wrote: > ---- magnum <john.magnum@...hmail.com> wrote: >> On 18 Apr, 2013, at 1:51 , jfoug <jfoug@....net> wrote: >>> I have posted a 6000+ line patch to magnum. I was able to get gains better than what I hoped for dyna. I have almost all issues worked out. >> >> Latest code in relbench (dynamic only). This is non-OMP build vs. 4xOMP, real i7 cores: >> non-OMP build vs. OMP build running one 1 core: >> >> Number of benchmarks: 94 >> Minimum: 0.44944 real, 0.44944 virtual >> Maximum: 1.04394 real, 1.02326 virtual >> Median: 0.93713 real, 0.93713 virtual >> Median absolute deviation: 0.06278 real, 0.06210 virtual >> Geometric mean: 0.86919 real, 0.86863 virtual >> Geometric standard deviation: 1.23741 real, 1.23690 virtual >> >> This is bad and maybe we can make it better. Ideally, an OMP build running one core should be in par with a non-OMP build. > > Not in the way dyna OMP is put together. > > nonOMP simply walks the array of primitive Dyna-functions. Each step is run one time, and does THAT step for all candidates. So, if there are 4 steps in the script, there is 4 function calls. Now, that IS 4 'primitive' calls to only 120 (SSE para-3) candidates, which is 0.0333 primitive calls per candidate (in this example). > > In OMP mode, the layout of the primitive Dyna-functions has been changed. Instead of void func(void), they are been changed to void func(int start, int stop). Also, the candidates are 48x more (scale). The way OMP was done in dyna, is that each thread will be given a range of candidates. So each thread will run all 4 steps back to back, BUT will only perform the steps over the threads given partial range. So, for 1x OMP, if it is an SSE 3x para build, there will be 4*48x120/12, which is 1920 primitive calls, to do 5760 candidates, which comes 0.33333 primitive calls per candidate (i.e. 10x more). Also the primitives have 2 params, where before they had none. This is without looking at any code, but could you not address this simply by bumping BLOCK_LOOPS (or equivalent) by 10 (perhaps 16) for OMP? So if a non-OMP build does 10 of them, OMP would do 160. Would that mitigate the impact you describe? magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.