john-dev - Re: Sayantan:Weekly Report #11

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+TsHUDpGq2LMETQ2C3Uq7cd4f9VSXNMsy_YxQN02M_HjzyR_w@mail.gmail.com>
Date: Thu, 5 Jul 2012 12:11:14 +0530
From: SAYANTAN DATTA <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Sayantan:Weekly Report #11

On Thu, Jul 5, 2012 at 10:36 AM, Solar Designer <solar@...nwall.com> wrote:

>
> http://openwall.info/wiki/john/development/AMD-IL
>
> (or suggest a more suitable name).  Include the tiny(?) test programs in
> there, command-lines used to compile, sample output.  The style of that
> wiki page can be similar to:
>
> http://openwall.info/wiki/internal/gcc-local-build
>

Okay I'll do it today.

>
> Is there already a need to integrate the support for this into JtR?
> In other words, what would your next steps be after such integration?
>

No it is not necessary to integrate the binary generator in JtR.
Modification of the IL can be done outside JtR environment. But to use the
binaries there is a need to add a few lines of codes in JtR opencl
environment. However only the generation of binary would require
libelf-devl. Using the binaries won't require anything additional.



> Do you know what specific IL-level changes you'd try?
>

First I would try to minimize the number of instructions.


> OK.  Can you try to answer these questions? -
>
> Which execution units and how many per CU does your bf_kernel.cl (as
> released in 1.7.9-jumbo-6) use on 7970?
>

> Does it use scatter/gather addressing?  Or only gather?  Or neither?
>
> I will answer them later after doing some research.


> How much LDS does it actually use?  Does it use any other memory type(s)
> within the inner loop?
>

> Before you started with bf_kernel.cl, I estimated that we might be able
> to use up to 1/4th of 7970's computing resources yet fit in LDS.  Is
> this what your code is trying to use or do you limit it to less, and why?
>
> I recall that you wrote somewhere that you were only able to use 32 KB
> of LDS per CU, not the full 64 KB.  (However, I can't find this now.)
> Do I recall correctly, and if so why is that?
>

According to AMD there is an upper limit of 32KB of LDS per work group.
However it doesn't mean you cannot use full 64 KB LDS. You would need to
dispatch two or more work groups to utilize full 64KB LDS. So if work group
size is limited to 8. Therefore in order to fully utilize 64KB LDS you need
to dispatch at least 8x2=16 work items per CU which accounts to 16x32=512
work items globally . Also to hide the ALU latency one can increase the
work items in multiples of 512. However the number of inflight wavefronts
is strictly limited by LDS and GPR count.Since we are already using 64KB
LDS by dispatching 512 work items ,the number of inflight wavefront should
be limited to zero.As a result there is only a minor increase in
performance due to increse in global work items from 512 to1024.  Therfore
your calculation the utilization of one out of four SIMD units per CU on
7970 is valid for current kernel.


> Besides all of the above, your idea to use local and global memory at
> once is a good one.  If we don't achieve much with other optimizations,
> perhaps we'll achieve something like 7000 c/s on 7970 by adding some use
> of global memory in parallel with the local memory uses currently in
> your kernel.
>

I once tried two john builds together on 7970. One running on LDS and the
other on global memory but it caused an asic hang. Maybe we need to merge
them together under one build. But how to mearge them remains a question.
One of the two possible way is to call two clEnqueKernels per crypt. Or we
can merge the two kernels. Also how  the  two branches get scheduled on gpu
will impact performance.

Regards,
Sayantan

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.