john-dev - Re: Got all dyna formats (except $1$ and $apr1$) working with OMP

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130419001910.J0YDS.37573.imail@eastrmwml206>
Date: Fri, 19 Apr 2013 0:19:10 -0400
From:  <jfoug@....net>
To: john-dev@...ts.openwall.com
Subject: Re: Got all dyna formats (except $1$ and $apr1$) working
 with OMP

---- jfoug@....net wrote: 
> ---- magnum <john.magnum@...hmail.com> wrote: 
> This made 1 thread OMP work almost same speed as non-OMP, for 'some' dynas.  However, in others, things were bad.  60%, 50% and even some slower than that (40% or so).
> 
> I THINK this is due to unicode checking, calling omp_thread_num() within many of the string functions.

I am pretty sure the the thread safe unicode data was the bottleneck.  There may be others still lurking, I will check.

Here is the new call withing the OMP for loop:

(*(curdat.dynamic_FUNCTIONS[i]))(j,top,omp_get_thread_num());

The 3rd param was added (to all primitives and some helper functions).  I WILL need to do some #define magic for non-OMP builds, for some of the non-primitive helper functions (like the unicode getter and setter), but all in all, it should be pretty trivial.

here are timings of dyna0 and dyna1. 

*** Non OMP:

Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 10x4x3]... DONE
Raw:    27730K c/s real, 27764K c/s virtual

Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 10x4x3]... DONE
Many salts:     16422K c/s real, 16394K c/s virtual
Only one salt:  12244K c/s real, 12259K c/s virtual

*** OMP 1x thread id as 3rd param

Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 480x4x3]... DONE
Raw:    26282K c/s real, 26285K c/s virtual

Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 480x4x3]... DONE
Many salts:     14510K c/s real, 14499K c/s virtual
Only one salt:  11237K c/s real, 11241K c/s virtual

*** OMP 1x thread id being computed within unicode thread getter/setter

Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics 480x4x3]... DONE
Raw:    26135K c/s real, 26125K c/s virtual

Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics 480x4x3]... DONE
Many salts:     6952K c/s real, 6951K c/s virtual
Only one salt:  6066K c/s real, 6064K c/s virtual

In the 3rd param method, we are calling omp_get_thread_num() 4 times for every 5760 candidates.  For the one where the omp_get_thread_num() call was in the unicode getter/setter, omp_get_thread_num() was being called at least 11520 times per each 5760 candidates!!!!  That could be GREATLY reduced (basically a  loop-invariant code motion).  But using the 2nd method (newest), it simply is an inline function to a array.  So a smart compiler will actually do the loop invariant motion for us.

Thanks for pointing out the problem.  I may be able to use this hints to reduce other overhead.

Jim.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.