Date: Fri, 19 Apr 2013 19:22:19 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Re: minor raw-sha1-ng pull request On 19 Apr, 2013, at 19:12 , magnum <john.magnum@...hmail.com> wrote: >> $ ../run/john -test -fo=raw-sha1-ng >> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics >> 4x]...(8xOMP) DONE >> Raw: 23232K c/s real, 3338K c/s virtual >> >> I don't understand why real is so different than virtual, compared to >> without omp: >> >> $ ../run/john -test -fo=raw-sha1-ng >> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics 4x]... DONE >> Raw: 9251K c/s real, 9251K c/s virtual >> >> What am I doing wrong? (I already batch crypts, so I figured I could just >> split the work across threads if available, maybe this was naive). > > This is expected. The raw figures are hashes/wall-clock-time and the virtual ones are hashes/CPU-time. If you could get it to scale well, the virtual figure would be near a non-OMP one. > > So for 8x OMP you only get ~2.5x speed. As long as you don't get lower speeds than for one core, we can commit it for sure. I think you need to run much larger batches under OMP (OMP_SCALE in rawSHA256_ng_fmt.c) for hiding the overhead. I got nt2 to scale fairly well on Intel with an OMP_SCALE of 1536. That is, it runs 1536*MMX_COEF*MD4_PARA crypts per call, per core. Or put another way, the for loop will submit 1536 normal batches to each thread. OK, I get this for non-OMP build: Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... DONE Raw: 23784K c/s real, 23784K c/s virtual And this for OMP-build but running 1 core: Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... DONE Raw: 23553K c/s real, 23553K c/s virtual That's fine. But trying to use more cores does not work well: Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... (4xOMP) DONE Raw: 16872K c/s real, 9373K c/s virtual I see you already have SHA1_PARALLEL_HASH of 512. Look at init() in raw-sha256-ng and try to mimic that - you probable want to use an OMP_SCALE of 3 and the number of keys would be actual number of cores in use * OMP_SCALE * SHA1_PARALLEL_HASH. I bet this will give much better results. But this means you need to dynamically allocate the buffers. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.