john-dev - GTX TITAN (was: new dev box wishes)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130915020632.GA18666@openwall.com>
Date: Sun, 15 Sep 2013 06:06:32 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: GTX TITAN (was: new dev box wishes)

On Wed, Jun 26, 2013 at 02:56:44PM +0400, Solar Designer wrote:
> http://www.xbitlabs.com/articles/graphics/display/zotac-geforce-gtx-titan-amp-edition-6gb-sli.html#sect0
> 
> Stock clocks:
> 837 MHz base, 876 MHz boost, 6008 MHz memory
> Vendor o/c:
> 902 MHz base, 954 MHz boost, 6608 MHz memory

I just ran the FlopsCL_src_linux.zip benchmark from here
http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/projects.html on our TITAN.
It is in fact faster than stock TITAN, exceeding its peak GFLOPS at
single-precision (should be 4500 GFLOPS for stock, we get 5000+ peak).
However, as I was afraid, its double-precision performance is currently
locked, which per some forum comments is unlock-able to full via
nvidia-settings (need to add this GPU to xorg.conf first for that).
Unfortunately, as expected, "nvidia-smi --gom=..." refused to work on
this GPU (most nvidia-smi features work on TESLA cards only).

                                [float   ] Time: 0.085911s, 3199.58 GFLOP/s
                                [float2  ] Time: 0.156523s, 3512.29 GFLOP/s
                                [float4  ] Time: 0.219421s, 5010.96 GFLOP/s
                                [float8  ] Time: 0.472510s, 4653.92 GFLOP/s
                                [float16 ] Time: 0.885512s, 4966.67 GFLOP/s
                                [double  ] Time: 1.176065s, 233.73 GFLOP/s
                                [double2 ] Time: 2.352377s, 233.70 GFLOP/s
                                [double4 ] Time: 4.700395s, 233.92 GFLOP/s
                                [double8 ] Time: 9.401957s, 233.89 GFLOP/s
ERROR: clEnqueueNDRangeKernel failed, cl_out_of_resources
                                [double16] Time: 0.016776s, 262160.00 GFLOP/s

I don't know why double16 fails.

FlopsCUDA_src_linux.zip, after teaching it about compute capability 3.5
corresponding to 192 SPs/MP (added one line to the table), gives:

----- Standard benchmark, sequential instructions are dependent -------------

        [Device  0,    GeForce GTX TITAN] Time: 0.042523 (s), Total FLOPs : 134217728000
        [Device  0,    GeForce GTX TITAN] Peak GFLOP/s: 5128.70, Actual GFLOP/s: 3156.4, 61.543% efficiency

----- Instruction-level parallelism (ILP): multiple independent instructions (i.e. used by Kepler's warp scheduler) ----

        [Device  0,    GeForce GTX TITAN] (ILP) Time: 0.122247 (s), Total FLOPs : 536870912000
        [Device  0,    GeForce GTX TITAN] (ILP) Peak GFLOP/s: 5128.70, Actual GFLOP/s: 4391.7, 85.630% efficiency

The "Peak GFLOP/s" it calculates from querying the device for MP count,
compute capability, and max boost clock rate (it gets 954 MHz here).

I guess these results may teach us something about optimization for this
GPU (and other Kepler GPUs?) - four-element vectors or(/and?)
interleaving of independent instructions give best results.

As to double-precision performance, indeed it does not matter for JtR
(at least currently), yet it may be relevant if we let other projects
use our dev boxes as well.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.