Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 3 Apr 2015 12:59:56 +0300
From: Solar Designer <>
Subject: Re: [GSoC] John the Ripper support for PHC finalists

Agnieszka, magnum, Frank -

On Fri, Apr 03, 2015 at 04:31:28AM +0200, Agnieszka Bielec wrote:
> 2015-04-02 10:33 GMT+02:00 Agnieszka Bielec <>:
> >    thanks but it seems that first is executed init function (where I need
> access to cost) before functions tunable_cost_*
> if this is impossible I must add my option to JohnTheRipper or make
> another big modifications. Should I do it? I wanted to avoid this
> and today I was fighting with the format and moved the init function
> to be called inside set_salt
> "
> if(last_t_cost!=t_cost || last_m_cost!=m_cost)
>         {
>            if(last_t_cost !=0 && last_m_cost !=0)
>              done();
>            last_t_cost=t_cost;
>            last_m_cost=m_cost;
>            init(fmt_pomelo_tmp,t_cost,m_cost);
>         }
> "
> and many more tricks, now it works but I have some bugs and
> also as before it is necessary to make modifications in files
> like loader.c
> I've added atomatic generating hashes for autotune
> I know, there are some things I must to correct but I want to make
> most important things before

magnum, Frank - can you please help Agnieszka with the above?

> now in my repository M_COST and T_COST are hardcoded
> to value 2
> tests for M_COST=2 and T_COST=2:
> It's better than before but 2 is a very little value for POMELO

What memory usage (per hash) does it correspond to?

> [ run]$ ./john --format=pomelo-opencl -dev=2 --test
> Device 2: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Local worksize (LWS) 1, global worksize (GWS) 2048
> Benchmarking: pomelo-opencl, POMELO [POMELO OpenCL (inefficient,
> development use only)]... DONE
> Raw:    200704 c/s real, 6432 c/s virtual
> [ run]$ ./john --format=pomelo-opencl -dev=3 --test
> Device 3: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Build log: Compilation started
> Compilation done
> Linking started
> Linking done
> Kernel <pomelo_crypt_kernel> was not vectorized
> Done.
> Local worksize (LWS) 8, global worksize (GWS) 65536
> Benchmarking: pomelo-opencl, POMELO [POMELO OpenCL (inefficient,
> development use only)]... DONE
> Raw:    198593 c/s real, 6403 c/s virtual

OK, now it's same speed for both OpenCL SDKs.

What speed are you getting with the C + OpenMP code (no OpenCL)?  Please
test with GOMP_CPU_AFFINITY=0-31 and with OMP_NUM_THREADS=16.  These
should be separate tests, so you'd run e.g. these three tests:

../run/john -te -form=pomelo


GOMP_CPU_AFFINITY=0-31 ../run/john -te -form=pomelo


OMP_NUM_THREADS=16 ../run/john -te -form=pomelo

without saving these variables to the default exported environment.

> not vectorized but fast, should I worry about this?

Maybe.  Does your C + OpenMP code use SIMD intrinsics?  It should, when
available.  The POMELO v2 PHC submission readily includes SIMD code for
SSE*/AVX (128-bit) and for AVX2 (256-bit).  On super, you should be able
to build it for 128-bit AVX.  This will give you the performance
baseline, and perhaps it'll achieve much more than the 200k c/s you're
seeing here.  Then you'll need to worry that your OpenCL kernels are
slower, and make them at least as fast.  You might need to use explicit
vector types (perhaps 4x 64-bit) in the OpenCL kernels to achieve this.

> [ run]$ ./john --format=pomelo-opencl -dev=4 --test
> Device 4: Intel(R) Many Integrated Core Acceleration Card

As some point, you'll need to bring your code to build natively for MIC,
and run it on the MIC card (with 240 OpenMP threads).  Since MIC offers
only 512-bit intrinsics, but POMELO v2 has only 256-bit SIMD parallelism,
you'll probably need to choose between using scalar code and using the
512-bit intrinsics with half the vector width wasted.

> ./john --format=pomelo-opencl -dev=1 --test
> Device 1: Tahiti [AMD Radeon HD 7900 Series]
> OpenCL error (CL_OUT_OF_RESOURCES) in file (opencl_pomelo_fmt_plug.c) at
> line (519) - (failed in clEnqueueNDRangeKernel)
> it seems that I can allocate only wery small amounts of __private memory
> even if I set MEM_SIZE to 3000 --dev=1 crashes

Perhaps autotuning needs to apply not only to GWS and LWS, but also to
the 1-bit (?) choice of whether you use global or local memory here.
Perhaps what's possible and what's optimal will vary not only across
OpenCL devices, but also with m_cost, and with GWS and LWS being tuned.

Once again, thank you for working on this!


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.