Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 22 May 2015 23:19:46 +0200
From: Marek Wrzosek <marek.wrzosek@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: Bleeding jumbo now defaults to UTF-8

W dniu 22.05.2015 o 22:25, magnum pisze:
> On 2015-05-22 21:17, Marek Wrzosek wrote:
>> W dniu 22.05.2015 o 18:33, magnum pisze:
>>> On 2015-05-22 16:48, Marek Wrzosek wrote:
>>>> What is the simplest way to "repair" all.lst from Openwall?
>>>
>>> I bet it's a mix of encodings so can't simply be converted. No tool in
>>> the world will correctly guess each indivial line's encoding (I have
>>> seen tools that try, but never one that was any good at it).
>>>
>>> But all.lst is just a mix of all the separate smaller files. Ideally
>>> each of them should be converted to UTF-8 (from whatever respective
>>> codepage), and a new all_utf8.lst could then be created from this.
>>
>> I've already created something like this using latin1, koi8-r and
>> cp1251, but two latter are russian-only so after unique there is only
>> one of them. I also created file ru_all.lst_utf8 with russian-only
>> passwords (for use with e.g. --rules=jumbo).
>> It's against netiquette to attach such big files to e-mails so here are
>> links:
>> https://dl.dropboxusercontent.com/u/68111957/all.lst_utf8.gz
>> https://dl.dropboxusercontent.com/u/68111957/ru_all.lst_utf8.gz
>>
>> I hope they are fine.
> 
> Cool. However, all.lst_utf8 looks OK at first but contains half a
> million lines of double-encoded Unicode. I should probably mention there
> is a tool in Jumbo, cprepair, that has some good heuristics for fixing
> that very problem and some others. I usually don't talk about it because
> I haven't had the inspiration to document it :-)
> "../run/cprepair -h" will show usage though.
> 
> Check files (no output from "-s -d" means they seem to be fine):
> $ ../run/cprepair -s -d ru_all.lst_utf8
> filename: ru_all.lst_utf8
> 
> $ ../run/cprepair -s -d all.lst_utf8 | head
> filename: all.lst_utf8
> abergläubischen => abergläubischen
> abfällt => abfällt
> abgeändert => abgeändert
> abgeänderte => abgeänderte
> abgeänderten => abgeänderten
> abgehängten => abgehängten
> abgeklärt => abgeklärt
> abgekürzt => abgekürzt
> abgelöscht => abgelöscht
> 
> Fix the latter:
> $ ../run/cprepair all.lst_utf8 | ../run/unique all2.lst_utf8
> Total lines read 4917041 Unique lines written 4435306
> 
> magnum
> 
Thank you, magnum! :-)
-- 
Marek Wrzosek
marek.wrzosek@...il.com

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.