Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 22 May 2015 22:25:45 +0200
From: magnum <john.magnum@...hmail.com>
To: john-users@...ts.openwall.com
Subject: Re: Bleeding jumbo now defaults to UTF-8

On 2015-05-22 21:17, Marek Wrzosek wrote:
> W dniu 22.05.2015 o 18:33, magnum pisze:
>> On 2015-05-22 16:48, Marek Wrzosek wrote:
>>> What is the simplest way to "repair" all.lst from Openwall?
>>
>> I bet it's a mix of encodings so can't simply be converted. No tool in
>> the world will correctly guess each indivial line's encoding (I have
>> seen tools that try, but never one that was any good at it).
>>
>> But all.lst is just a mix of all the separate smaller files. Ideally
>> each of them should be converted to UTF-8 (from whatever respective
>> codepage), and a new all_utf8.lst could then be created from this.
>
> I've already created something like this using latin1, koi8-r and
> cp1251, but two latter are russian-only so after unique there is only
> one of them. I also created file ru_all.lst_utf8 with russian-only
> passwords (for use with e.g. --rules=jumbo).
> It's against netiquette to attach such big files to e-mails so here are
> links:
> https://dl.dropboxusercontent.com/u/68111957/all.lst_utf8.gz
> https://dl.dropboxusercontent.com/u/68111957/ru_all.lst_utf8.gz
>
> I hope they are fine.

Cool. However, all.lst_utf8 looks OK at first but contains half a 
million lines of double-encoded Unicode. I should probably mention there 
is a tool in Jumbo, cprepair, that has some good heuristics for fixing 
that very problem and some others. I usually don't talk about it because 
I haven't had the inspiration to document it :-)
"../run/cprepair -h" will show usage though.

Check files (no output from "-s -d" means they seem to be fine):
$ ../run/cprepair -s -d ru_all.lst_utf8
filename: ru_all.lst_utf8

$ ../run/cprepair -s -d all.lst_utf8 | head
filename: all.lst_utf8
abergläubischen => abergläubischen
abfällt => abfällt
abgeändert => abgeändert
abgeänderte => abgeänderte
abgeänderten => abgeänderten
abgehängten => abgehängten
abgeklärt => abgeklärt
abgekürzt => abgekürzt
abgelöscht => abgelöscht

Fix the latter:
$ ../run/cprepair all.lst_utf8 | ../run/unique all2.lst_utf8
Total lines read 4917041 Unique lines written 4435306

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.