Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 19 Apr 2013 0:19:10 -0400
From:  <jfoug@....net>
To: john-dev@...ts.openwall.com
Subject: Re: Got all dyna formats (except $1$ and $apr1$) working
 with OMP

---- jfoug@....net wrote: 
> ---- magnum <john.magnum@...hmail.com> wrote: 
> This made 1 thread OMP work almost same speed as non-OMP, for 'some' dynas.  However, in others, things were bad.  60%, 50% and even some slower than that (40% or so).
> 
> I THINK this is due to unicode checking, calling omp_thread_num() within many of the string functions.

I am pretty sure the the thread safe unicode data was the bottleneck.  There may be others still lurking, I will check.

Here is the new call withing the OMP for loop:

(*(curdat.dynamic_FUNCTIONS[i]))(j,top,omp_get_thread_num());

The 3rd param was added (to all primitives and some helper functions).  I WILL need to do some #define magic for non-OMP builds, for some of the non-primitive helper functions (like the unicode getter and setter), but all in all, it should be pretty trivial.

here are timings of dyna0 and dyna1. 

*** Non OMP:

Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 10x4x3]... DONE
Raw:    27730K c/s real, 27764K c/s virtual

Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 10x4x3]... DONE
Many salts:     16422K c/s real, 16394K c/s virtual
Only one salt:  12244K c/s real, 12259K c/s virtual

*** OMP 1x thread id as 3rd param

Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 480x4x3]... DONE
Raw:    26282K c/s real, 26285K c/s virtual

Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 480x4x3]... DONE
Many salts:     14510K c/s real, 14499K c/s virtual
Only one salt:  11237K c/s real, 11241K c/s virtual

*** OMP 1x thread id being computed within unicode thread getter/setter

Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 480x4x3]... DONE
Raw:    26135K c/s real, 26125K c/s virtual

Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 480x4x3]... DONE
Many salts:     6952K c/s real, 6951K c/s virtual
Only one salt:  6066K c/s real, 6064K c/s virtual

In the 3rd param method, we are calling omp_get_thread_num() 4 times for every 5760 candidates.  For the one where the omp_get_thread_num() call was in the unicode getter/setter, omp_get_thread_num() was being called at least 11520 times per each 5760 candidates!!!!  That could be GREATLY reduced (basically a  loop-invariant code motion).  But using the 2nd method (newest), it simply is an inline function to a array.  So a smart compiler will actually do the loop invariant motion for us.

Thanks for pointing out the problem.  I may be able to use this hints to reduce other overhead.

Jim.

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ