Date: Mon, 11 Sep 2017 18:36:21 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Hyperthreading / fork versus mpi / instruction sets? On Sun, Sep 10, 2017 at 09:40:25PM +0200, spam@...lab.nl wrote: > I'm using john on one system with an enormous amount of CPUs. The CPUs > support hyperthreading. I'm trying to figure out what the fastest > combination of settings is. Typically it is: hyperthreading enabled and "--fork". > Hyperthreading: > As far as I understand this is beneficial only if the cracking code is not > 100% optimized. So in theory: not useful, each HT thread cannot do anything > useful since the 'real' core is fully saturated. Practice: run john on HT > cores as well to optimize CPU utilization. Even if the cracking code is 100% > optimized it won't harm me (no disadvantages). Two questions: (1) is this > correct? This is a start towards an understanding, but it is over-simplification. Other factors at play are register pressure (for a given algorithm on a given instruction set architecture) and maximum instruction decode and issue rate from one hardware thread (whether it's the same as what the core is capable of from multiple threads or not). Speaking of register pressure, sometimes it isn't possible to optimally utilize a core from just one thread because the number of registers needed for frequent storage and retrieval of intermediate results for a would-be-optimal number of interleaved instances of the algorithm exceeds what's available in ISA. Then we have to choose between interleaving fewer instances or spilling registers to memory (actually L1 cache), or a mix of both. (Register renaming - that is, transparent use of a larger number of hardware registers than are exposed via the ISA - helps a bit, but it can't help avoid spills to memory/cache in the ISA level code.) Why interleave multiple instances rather than implement just one? That's to keep the core's execution units busy doing useful work at all times, including during previously issued instructions' latencies. With 2 or more hardware threads/core, we effectively get more ISA level registers for the interleaving: each hardware thread has its own set of ISA level registers. (Yes, they share the same pool of hardware registers, but it's large enough. It's ISA registers that are the scarce resources.) We can then optimally use fewer interleaved instances of the algorithm per thread, and thus not run into register pressure and not incur spills. Sometimes optimal interleaving factor (in the ISA level code) varies depending on whether you intend to run e.g. 1 or 2 threads/core. For example, our bcrypt code with 2x interleaving is more optimal for Intel CPUs with HT enabled, and our bcrypt code with 3x interleaving is more optimal for Intel CPUs with HT disabled. Since both kinds of CPUs/settings exist, it is tough for us to provide universally optimal defaults/build (would need runtime detection and code dispatching), and currently we choose the 2x interleaving when building for modern x86 CPUs assuming that it's typical to have HT enabled and considering that it's slightly higher performance in absolute terms. Unfortunately, the performance hit this decision of ours has on the HT-disabled CPUs is pretty bad. (Frank recently committed a fix so that advanced users can use "make generic" to have bcrypt's interleaving factor tuned for their CPU at build time. But that's a special case, and it hurts performance at other JtR formats.) I use this example here to illustrate that the problem is complicated. Speaking of maximum instruction decode and issue rates, for general-purpose x86 CPUs these tend to be available even from just one hardware thread. (Indeed, Intel sells crippled versions of their CPUs that are not HT-capable, yet with optimal code those can usually be almost as fast as their HT-capable elder siblings.) However, this isn't always the case. For Intel's Xeon Phi (at least first generation; I am not familiar enough with second generation yet), you actually have to run at least 2 threads/core (and up to 4) to even have a chance of using the cores fully. Similarly, IBM's POWER chips (and I think Oracle's recent SPARC chips) assume use of multiple hardware threads/core in order to have a chance of full utilization: http://www.openwall.com/lists/john-dev/2015/10/19/1 For HPC workloads, it is fairly common to disable HT. That's largely wrong for those as well, but it's also different. The causes include poor scalability of the algorithms or their implementations (so that having e.g. twice more logical CPUs which combined are just slightly faster becomes a bad idea) and the algorithms being memory-bound (thus heavily competing for shared caches and memory, and not benefiting as much from greater processing power without a corresponding increase in cache sizes and memory bandwidth). This may be right given fixed code with no intent to optimize, but it may be wrong to give up on a chance to optimize and make beneficial use of multiple hardware threads, even for those workloads. For memory-bound workloads, multiple hardware threads per core help avoid keeping the core idle while one of the threads is waiting for memory. Care may need to be taken to minimize competition for caches - e.g., on one rather crazy occasion I ended up having the application parse /proc/cpuinfo, assign CPU affinities, and use per hardware core mutexes to avoid having two hardware threads of the same core getting into the same code section at once, where one thread's exclusive use of that core's portion of L3 cache was more beneficial for overall performance. I understand that few developers would discover opportunities like this and go for this sort of effort, and accept the need to reconsider such decisions after each hardware upgrade, and thus many HPC clusters run with HT disabled, thereby wasting some of their performance potential. Luckily, we don't run into this sort of dilemmas for most functionality in JtR. > And (2) any advice about the number of extra HT processes to > assign? Use all? Use just say one or two to compensate for a small fraction > of non-perfect code? Use all. "Use just say one or two" can't possibly fully compensate for "non-perfect code" anyway (even if this were the only problem HT was meant to solve, which isn't the case), because the code will run on each core. So when you run extra threads on some cores, but not on all at the same time, you have this compensation on some, but not on all. > Fork vs. MPI: > I've mentioned that there is a number of hash formats that support MPI and > that john runs those hash types on MPI by default. Furthermore I've seen > that forked parallel processing (--fork=n) is possible for all hash types. > AFAIK, MPI is typically used in network connected multi-system environments. > Forking is done on one machine. My assumption is that forking is more > efficient than MPI because of less overhead (= faster). However MPI might > allow more granular control, rescheduling during the cracking process to get > maximum efficiency, but *only* useful if MPI latency is extremely low > compared the cracking speed. My questions questions: (1) is this correct? No. As you've realized, you confused MPI and OpenMP. > Furthermore: (2) what's the best approach for fast hashes (e.g. raw-md5) and "--fork", or use "--format=raw-md5-opencl --mask=..." with a GPU. > (3) what's the best approach for slow hashes (e.g. bcrypt)? Also "--fork", but the relative slowdown of OpenMP is smaller for slow hashes than it is for fast hashes. That's in terms of c/s rate. There are some other factors at play: how optimal the order of candidate passwords tested is, whether all jobs complete nearly simultaneously or not, ease of use. OpenMP may be better at these (albeit not always as in some cases it increases the amount of buffering), but usually not to a sufficient extent to prefer it over "--fork". The rescheduling you mentioned is typically disabled by default anyway, and is limited by the program's architecture (frequent sync points). With libgomp (what gcc uses), you can enable it with an environment variable, but it might make sense only when attacking slow hashes (or else the rescheduling is relatively too expensive on its own) on a system that has other load on it - which is a bad idea anyway. OpenMP is particularly sensitive to other load. If you have any unrelated load, then either avoid OpenMP or lower its thread count (leaving some logical CPUs idle some of the time). So this is actually another reason to prefer "--fork", where getting out of sync doesn't result in any loss in c/s rate and doesn't need any rescheduling, but instead may result in the child processes terminating at different times. Hopefully, such distractions even out over a long period of time, and don't result in too different termination times. Unfortunately, the current JtR architecture is far from optimal at its use of thread-level parallelism. Some of the above tradeoffs could be reduced or avoided with a different architecture. > Instruction sets: > I've mentioned that john contains *a lot* of instruction set specific > optimized code (e.g. SSE4.x/AVXx). Older multi CPU Xeon E5 and E7 systems > are quite cheap nowadays, and looking at absolute performance (in general) > they're still extremely fast (still in the list of fastest processors). > However they lack e.g. AVX2 support. Right now it's very difficult for me to > figure out what's the best choice, without buying massively expensive new > CPUs. Question: is there a benchmark available of something like the latest > fancy everything supporting CPU versus different instruction sets builds on > this system, so one can figure out what the advantage of buying new CPUs is, > from a john cracking perspective? We have some benchmarks on the wiki, but we haven't kept them up to date: http://openwall.info/wiki/john/benchmarks AVX2 almost doubles the speeds (relative to AVX) for formats where we have SIMD code. It makes no difference for formats where we don't (e.g., it won't help for bcrypt). AVX-512 is a further speedup, but you're unlikely to find those CPUs cheap yet. > Any other considerations or wise advice, especially concerning maximizing > CPU cracking is also more than welcome! Thank you. For CPUs, consider getting engineering samples of pairs of Xeon E5 v3 or newer off eBay. Also consider GPUs. With CPUs, you'll get more JtR formats, and bcrypt will perform better (but lose to FPGAs). With GPUs, you'll get better speeds per dollar and per Watt for most of the formats where we have OpenCL code (and you'll also use Hashcat), but for fast hashes you'll get decent speeds only as long as you use on-device candidate password generation (for JtR this means at least partial use of "--mask"). Avoid first-generation Xeon Phi even if cheap. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.