john-dev - Re: PHC: Parallel in OpenCL

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHWdfFhFQmgV2oqiYSJPC__u=q-FPfQ5+S9JT-uSRDR-Kw@mail.gmail.com>
Date: Tue, 2 Jun 2015 20:36:36 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Parallel in OpenCL

2015-06-02 5:37 GMT+02:00 Lukas Odzioba <lukas.odzioba@...il.com>:
> I just compared code on two branches and I don't think that what you
> did it is the proper way of doing split kernel...
> I guess it should be clear to see using profiler.
>
>> GCN without "add 0" optimization
>> Device 1: Tahiti [AMD Radeon HD 7900 Series]
>> Many salts:     45093 c/s real, 4915K c/s virtual
>
>> GCN with unrolling one loop
>> Many salts:     27536 c/s real, 3276K c/s virtual
>
> The result you are giving here is for add 0 optimization and 4
> kernels, and I guess the latter is the problem here.
> This really confused me until I noticed a real big difference between
> two branches - not just unrolling one loop.
> Please be more specific in the future, otherwise we will be wasting time.

the speed decreases after change this loop:

for (int i = 0; i < 16; i++) {
        t1 = k[i] + w[i] + h + Sigma1(e) + Ch(e, f, g);
        t2 = Maj(a, b, c) + Sigma0(a);

        h = g;
        g = f;
        f = e;
        e = d + t1;
        d = c;
        c = b;
        b = a;
        a = t1 + t2;
    }

to

for (int i = 0; i < 10; i++) {
        t1 = k[i] + w[i] + h + Sigma1(e) + Ch(e, f, g);
        t2 = Maj(a, b, c) + Sigma0(a);

        h = g;
        g = f;
        f = e;
        e = d + t1;
        d = c;
        c = b;
        b = a;
        a = t1 + t2;
    }

    for (int i = 10; i < 15; i++) {
        t1 = k[i] + h + Sigma1(e) + Ch(e, f, g);
        t2 = Maj(a, b, c) + Sigma0(a);

        h = g;
        g = f;
        f = e;
        e = d + t1;
        d = c;
        c = b;
        b = a;
        a = t1 + t2;
    }

    t1 = k[15] + w[15] + h + Sigma1(e) + Ch(e, f, g);
    t2 = Maj(a, b, c) + Sigma0(a);

    h = g;
    g = f;
    f = e;
    e = d + t1;
    d = c;
    c = b;
    b = a;
    a = t1 + t2;

in the same branch parallel_opt as I said you on IRC.
this can't be fault of split kernels

and also if i  change the second loop in similar way, speed decreases.

first loop:

Symbol table '.symtab' contains 15 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 00000000   186 OBJECT  LOCAL  DEFAULT    5 __OpenCL_compile_options
     2: 00000000   640 OBJECT  LOCAL  DEFAULT    6 __OpenCL_0_global
     3: 00000280   559 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     4: 00000000 40490 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
     5: 000004af    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     6: 000004cf   619 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     7: 00009e2a 25430 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
     8: 0000073a    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     9: 0000075a   633 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
    10: 00010180 43430 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
    11: 000009d3    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
    12: 000009f3   623 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
    13: 0001ab26 43170 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
    14: 00000c62    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_

second loop:

   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 00000000   186 OBJECT  LOCAL  DEFAULT    5 __OpenCL_compile_options
     2: 00000000   640 OBJECT  LOCAL  DEFAULT    6 __OpenCL_0_global
     3: 00000280   559 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     4: 00000000 40490 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
     5: 000004af    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     6: 000004cf   619 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     7: 00009e2a 56618 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
     8: 0000073a    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
     9: 0000075a   633 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
    10: 00017b54 43430 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
    11: 000009d3    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
    12: 000009f3   623 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
    13: 000224fa 43170 FUNC    LOCAL  DEFAULT    7 __OpenCL_parallel_kernel_
    14: 00000c62    32 OBJECT  LOCAL  DEFAULT    6 __OpenCL_parallel_kernel_
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.