john-users - Re: MPI Benchmark

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20181206030857.GB23388@openwall.com>
Date: Thu, 6 Dec 2018 04:08:57 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: MPI Benchmark

Hi Nicholas,

On Mon, Dec 03, 2018 at 04:16:47PM -0600, Nicholas McCollum wrote:
> I noticed on the benchmarking page, the quote, "That said, *if in doubt
> whether your benchmark results are of value, please do submit them*,"...

Right.  Thank you for posting these results.

> I have a small cluster sitting idle at the moment, so I thought I would run
> JtR across 792 cores of Skylake-SP.  I figured since the largest on the
> list was 384 cores of Nehalem series, it might be interesting.  I also have
> a 192 core (8x Xeon Platinum 8160, 6TB DDR4) OpenMP machine available for
> benchmarking if anyone is interested.

Yes, that would be interesting to see, too.

> I downloaded the github bleeding version and compiled JtR with OpenMPI
> 3.1.1 and GCC 8.2.0 with MPI support and verified that it did compile with
> AVX512 instructions.  Nodes are running CentOS 7.5.

It appears that you either have Hyperthreading disabled or at least
don't run enough processes to use it.  I'd also be interested in seeing
results with Hyperthreading in use, so 1584 on your MPI cluster and 384
on your OpenMP machine.

> I thought I would submit the results to the community.  I'm sure that this
> could be improved somewhat, and I am open to recompiling or tweaking if
> anyone is interested.

I'd start by comparing these against a single core run and a single node
run.  Need to see what scales and what does not.  You can use the
relbench script to compare benchmark outputs.

> This is 22 nodes of dual Xeon Gold 6150's with 12x 8GB DDR4 2666Mhz.
> 
> MPI in use, disabling OMP (see doc/README.mpi)
> Node numbers 1-792 of 792 (MPI)
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX-512]...
> (792xMPI) DONE
> Many salts:     12458M c/s real, 12583M c/s virtual
> Only one salt:  7761M c/s real, 7839M c/s virtual

This is pretty good speed.  12458/792 = 15.73M c/s per core, which is
better than we see e.g. on i7-4770K AVX2 cores (IIRC, 11M to 12M).
AVX-512 should have helped, but the lower clock rate hurt.  Overall,
this may very well be close to optimal (perhaps enabling/using HT will
provide a few percent speedup here).

It's roughly on par with 10 modern GPUs.

> Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES
> 512/512 AVX-512]... (792xMPI) DONE
> Speed for cost 1 (iteration count) of 725
> Many salts:     401779K c/s real, 405837K c/s virtual
> Only one salt:  321118K c/s real, 324362K c/s virtual

This is consistent with the above (is expected to be ~29 times lower).

> Benchmarking: md5crypt, crypt(3) $1$ [MD5 512/512 AVX512BW 16x3]...
> (792xMPI) DONE
> Raw:    7277K c/s real, 7277K c/s virtual

This is ~15 times lower than expected for your hardware, and is less
than one modern GPU.

> Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]...
> (792xMPI) DONE
> Speed for cost 1 (iteration count) of 32
> Raw:    186748 c/s real, 186748 c/s virtual

This is ~5 times lower than expected for your hardware, yet corresponds
to several modern GPUs... or less than two ZTEX 1.15y FPGA boards.
Also, this is tuned for use of HT, but you have it disabled - this alone
costs about 1/3 of performance at this test.

And so on.

I don't know why some hash types scaled well, but others poorly.  It'd
be helpful to figure this out and perhaps fix something.

> All 408 formats passed self-tests!

At least we have this. :-)

Thanks again,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.