john-dev - Re: ISO-8859-1 casing (experimental)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6FAA055EEF2E42F0ACD878267ECF0850@ath64dual>
Date: Sat, 23 Jul 2011 18:08:18 -0500
From: "JFoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: Re: ISO-8859-1 casing (experimental)

----- Original Message ----- 
From: "Solar Designer" <solar@...nwall.com>
>
> So I shouldn't merge your john-1.7.8-jumbo-2-iso_8859_1_tst-1, right?

Not yet.   I am getting close.  I have made numerous other fixes, and am 
working on the last one (getting SSE working again for x86-64 builds in 
md5-gen).   I did get a 'work around' for the md5-gen, but it was simply 
turning off SSE for x86-64 builds.  Thus it becomes hard for someone to 
'know' that on that platform, they have to run certain formats with a flag 
turning off SSE.  It makes doing md5-gen in john.conf hard to do.

As for md5-gen, I have:
- unicode support fully
- forced upcase/low case password handling,
- There was uc/lc of username, but only uc 'worked'.  that is now also 
'fixed', and uses unicode.c
- Provide ability to load test hashes only in -utf8 or only if -utf8 is NOT 
used.  This allows building a utf8 'capible' format, but have it work in 
iso-8859-1 mode, or in utf-8 mode, depending upon command line, much like 
what was done in the init on many other utf8 capible formats.
- Properly set the FMT_UTF8 and FMT_UNICODE in the options flags, when they 
'should' be set (during init of the format structures).
- Added raw-md5-unicode format into md5-gen (not optimal at this time).

>> I also changed
>> --create-charset=fname to --create-incremental=fname (and tried to change
>> the usage of charset in the documentation to 'follow'), to avoid 
>> confusion.
>
> I don't understand what you mean by "follow".

I meant that I changed the documentation, when there was mention of a 
'charset file', I changed it to 'incremental file'.  Often, when documenting 
the create, you used 'charset' or 'charset file', but then talked about the 
incremental run when documenting the --incr= mode.  I simply changed most of 
the usage (may have missed some, but I tried to get them), of the charset, 
to try to avoid confusion.

> BTW, the "casing" for iso-8859-1 is a subset of that for koi8-r (or it
> could be the other way around historically),

iso-8859-1 is a subset of many charsets.  It also overlaps the first 258 of 
Unicode proper.  There will likely be much charset functionality we can add 
(where we need it), based upon the 8859-1 code.  Dealing with fixed sized 
charsets like 8859-1 is much easier than variable sized chars such as utf-8.

Is koi8-r a fixed 8 bit charset?  I think so, but am not sure.  I found the 
wiki on it, it is fixed.  Also, it looks like D7<->F7 are up/low casing. 
For iso-8859-1 those chars are not up/low case. Also koi8-r has a case pair 
of A3<->B3.  So, there are only 3 'differences' in casing between iso-8859-1 
and koi8-r.   koi8-r cases A3<->B3, D7<->F7, DF<->FF (while iso-8859-1 does 
not).  iso-8859-1 upcases DF to 'S''S' (but there is no lower case of SS 
back to xDF).  The 1 char to 2 char rule in rules.c is not handled.

So, to get kio8-r into rules (for casing), would be very trivial.   Simply 
add another enum into the --charset=X processing, add another var to the 
options (the enum's get converted and set one of the values), then in rules 
in the initialization code, have a switch that loads the data properly. 
Very easy to do.

Likely there will be many more where we can perform 'simple' 8 bit casing 
within the rules, if the user is presenting john with a dict file made in a 
specific charset.

>missing only 3 rarely used
> Cyrillic letters.  That is, the letters are indeed different, but the
> needed transformation for the 30 pairs of 8-bit codes common to both of
> these encodings is exactly the same - and this is what matters for JtR.
>
> There are other Cyrillic encodings in use as well, though, for which the
> same would not hold true.  Besides koi8-r (formerly the most common
> Cyrillic encoding on Unix-like systems), the currently relevant ones are
> cp1251 (Windows) and indeed utf-8 (which is probably more popular than
> koi8-r on Unix-like systems by now).
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.