john-dev - Re: Discuss Mask-mode with GPUs and changes required for its support.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9f5cee50ac47bf1b5114b10b63b3a6a7@smtp.hushmail.com>
Date: Wed, 28 Jan 2015 09:36:50 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Discuss Mask-mode with GPUs and changes required for
 its support.

On 2015-01-27 21:49, Sayantan Datta wrote:
> On Wed, Nov 12, 2014 at 5:35 AM, magnum <john.magnum@...hmail.com> wrote:
> 
>> On 2014-11-05 16:02, Sayantan Datta wrote:
>>
>>> Based on my earlier experience with mask-mode, it was necessary to write
>>> separate kernels for benchmark, mask-mode(password on GPU) and non-mask
>>> modes. However, as much as I would love to unify them under a common
>>> kernel, with all of them following the same code path, it is difficult to
>>> do so without making debatable changes.
>>>
>>
>> Btw here's some food for thought:
>>
>> The NT kernel could either (among alternatives):
>> a) Only suppoort ISO-8859-1 (like Hashcat).
>> b) Transfer base words in UTF-16 to GPU, and use a UTF-16 version of
>> GPU-side mask generation.
>> c) Support UTF-8/codepage conversions on GPU (NTLMv2 and krb5pa-md5
>> kernels currently do this). So we transfer base words in UTF-8 or a CP to
>> GPU, apply the mask and finally convert to UTF-16 on GPU.
>> d) some combination of b and c. For example, transfer basewords to GPU in
>> UTF-8/CP, then convert them to UTF-16 once, finally apply mask with a
>> UTF-16 version of mask mode.
>>
>> IMHO we should *definitely* have full UTF-8/codepage support, the question
>> is how. We will never be quite as fast as Hashcat with NT hashes anyway so
>> we should beat it with functionality. So in my book, option a is totally
>> out of the question.
>>
>> Option b is simplest but typically need twice the bandwidth for PCI
>> transfers (which is not much of a problem when we run hybrid mask) while
>> option c needs somewhat more complex GPU code. I guess option b is
>> typically fastest for mask mode. However, option c is fastest when not
>> using a mask.
>>
>> magnum
>>
>>
> Regarding bandwidth, I don't understand how transferring UTF16 words as
> UTF8 would save PCIe bandwidth. If I am correct, UTF16 can support 65536
> values while UTF8 supports only upto 255. So any character with unsigned
> value greater than 255 must be sent as two UTF8 words or a single UTF16
> words and both choices should require same bandwidth. So I believe we are
> only talking about UTF16 words with values <= 255.

Right, and worse: Actually any character >128 will be two or more bytes
in UTF-8. So sometimes UTF-16 and UTF-8 will be same size and sometimes
UTF-8 will take even more (this should be rare though). However, only
0.16% of the "Rockyou" words include *any* non-ASCII character so UTF-8
should be the best bet (unless GPU conversion is more expensive than
transfer).

Another thing we can do (currently implemented in ntlmv2-opencl,
krb5pa-md5-opencl and oldoffice-opencl) is transfer a codepage (always
8-bit) and convert to UTF-16 on GPU. These three formats can take
codepage *or* UTF-8 depending on your needs, and will convert to UTF-16
on GPU side.

> Also, does our mask mode support UTF16 placeholders? Because internally the
> design only support 8bit characters. Are we somehow converting UTF16
> placeholders into UTF8 placeholders?

Current mask-mode can use an internal encoding (not UTF-8 but some 8-bit
codepage the user deems suitable for the job, eg. CP1252). If conversion
takes place on GPU (as in the formats mentioned above), we send CP1252
over PCIe, which is always half the size of UTF-16. But again this is
not necessarily the best alternative for GPU-side mask mode.

> Currently, what we are doing is splitting the mask into two parts, one part
> generates the template candidates(by mask.c) while the other part generates
> the values to plug into the template(by mask_ext.c). Both activities are
> performed on CPU, but the actual plugging takes place inside GPU. The
> advantage is less complex kernel code which enables us to write unified
> kernel for mask mode, self-test and all other modes with least branch
> instructions. For UTF16 words with values <=255, it is better to generate
> the template candidates as UTF8 while the values to plug into the template
> as UTF16. The reason being we could use parallelism to convert UTF8 into
> UTF16 for template candidates while for the plug in values, to avoid
> redundant work it is better to do the conversion on CPU. I think this is
> what you have suggested in option 'd'.

Sounds good to me. Those 8-bit "codepage" placeholders mentioned above
could be converted to UTF-16 at this point and the kernel could be
UTF-16 only. For mask mode, that is: For non-mask mode (eg. pure
wordlist) we probably want some formats (eg. ntlmv2) still being able to
convert to UTF-16 on GPU.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.