john-users - Re: Bleeding jumbo now defaults to UTF-8

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <55B12344.5010109@gmail.com>
Date: Thu, 23 Jul 2015 19:24:20 +0200
From: Marek Wrzosek <marek.wrzosek@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: Bleeding jumbo now defaults to UTF-8

W dniu 22.07.2015 o 18:23, magnum pisze:
> On 2015-07-22 16:34, Marek Wrzosek wrote:
>> What is the one - proper way to use --inc=utf8 in new bleeding-jumbo?
>> I mean, which encoding option we should use - --input-encoding=utf-8,
>> --target-encoding=utf-8, --internal-encoding=utf-8 or just
>> --encoding=utf-8. Because none seems to work in case of --inc=utf8.
>> For --inc=latin1 --target-encoding=cp1252 is mandatory for pot file
>> to be utf-8 only and not mixed with other encodings.
> 
> The thing that mandates what encoding to use is what actual encoding was
> used by the system producing the hashes in the first place. If it's
> UCS-2/UTF-16 (eg. NT or MSSQL) you can use any encoding but if not, you
> *need* to tell JtR about what -target-enc to use (unless it's your
> default).
> 
> After the above is established: Will you give your input in some *other*
> encoding that your target (or default) encoding? In case of incremental
> mode that would not make any sense: You must use an incremental mode
> that corresponds with your encoding (any other approach would be slow).
> So instead of -target-encoding, just use -enc (a.k.a -input-enc) instead
> and do not specify any -target-enc (or set it same, that's the default).
> 
> Now, if you targeted old web hashes and picked -enc=latin1, you can use
> -inc=latin1. The default is -inc=ascii so it will always work, but
> things like "-inc=utf8 -enc=latin1" will definitely produce garbage.
> 
> -internal-encoding does not apply to incremental mode. It's only used in
> case of "utf8 wordlist -> rules -> utf8/16 hashes" and for "mask mode ->
> utf8/16 hashes" (if your mask contains non-ascii).
> 
>> PS. Without any encoding options there are characters that are not from
>> utf-8. The same with --enc=raw. Is there a bug with utf8 incremental
>> mode after defaulting to utf-8?
> 
> Incremental mode was not written with multi-byte charsets like UTF-8 in
> mind, so will sometimes produce some worthless invalid characters. You
> can add "-ext:filter_utf8" to filter them out but for fast formats it's
> better to just ignore them: The filter is much slower than the waste it
> mitigates.
> 
> magnum
> 
So this -inc=utf8 is producing garbage, because incremental mode isn't
ready to charsets like utf-8. Why is utf8.chr available then? What is
the format of .chr file? Is it possible to adapt current code of
incremental mode to generate Unicode characters from plane 0 using
utf-16 (e.g. using short instead of char)? If yes, then making utf-8
from them should be simply by encoding using this format (by using bit
fields or shifting bits - whatever is faster).
-- 
Marek Wrzosek
marek.wrzosek@...il.com

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.