Date: Tue, 3 Nov 2020 19:26:36 +0100 From: magnum <john.magnum@...hmail.com> To: john-users@...ts.openwall.com Subject: Re: Rules characters unicode support. On 2020-11-03 15:48, François wrote: > While running my tool on a very large (and old) leak, I realized that some > character substitutions from ASCII to Unicode were hitting some results (a > few hits on a large leak) for example: > seé > (...) > They're making sense, because some old RFC or specs prevent non ASCII > characters to be used in email address or login information but passwords > fields actually take them now. For example, we could imagine that a > password associated to my email address francois.pesce@...il.com could be > close to the way my French first name is actually written, thus "françois" > (possibly generated by a single rule substituting c to ç such as: scç ). > > However, it seems that currently, john(-jumbo) does not support Unicode > characters for all rules commands (except for the content of command A"..." > ). Is anyone working on supporting that use case, should I just try to use > the A"..." command for my niche finding ? What are your thoughts? While the Unicode support could be better, there are ways to achieve what you need. First of all, we need to tell John what encoding we're expecting the hashes to be made from. Nowadays that's usually a no-brainer, it use to be UTF-8 and that's also the deafult in john.conf. Now if your need would have been eg. CP1252, things would be simpler since such legacy codepages are all single-byte: You'd simply write your rules such as scç and then be sure to save that config file with CP1252 encoding. Run with --encoding=cp1252 and all should work just fine. With UTF-8 however, things currently aren't quite that easy because the rule engine does not (yet) honor multi-byte characters. But we have a work-around called --internal-codepage. What this does is we still expect UTF-8 input (the hash file, any wordlists) and we still produce hashes from an UTF-8 encoded cleartext - but internally within the rule engine we run the internal legacy codepage. Just pick any encoding that can hold all characters you need to use. So let's try it out: $ echo francois > words.lst $ cat john-local.conf [List.Rules:subs] seé suü scç $ ./john -stdout -w:words.lst -rules=subs -internal-codepage=cp1252 Invalid rule in (null) at line 2: Unknown command seé We get this error because john-local.conf contains UTF-8. John should actually be smarter here and handle that, but we do not yet. So let's encode our config file in CP1252 instead: $ mv john-local.conf john-local.utf8 $ iconv -t cp1252 < john-local.utf8 > john-local.conf $ ./john -stdout -w:words.lst -rules=subs -internal-codepage=cp1252 francois françois Another way of achieving the same is to use \xHH hex encoding. The value for "ç" in CP1252 is \xe7 so you'd just write it as sc\xe7 instead of scç. This way there's less risk of your editor messing things up in the future. This notation can also be handy when specifying a rule directly on the command line, like so: $ ./john -stdout -w:words.lst -internal-c=cp1252 -ru=':sc\xe7 u' FRANÇOIS As you can see, once you run with an internal codepage things like case-shifting (and nearly all other commands and character classes) will work for non-ASCII letters as well. We wouldn't want that to end up as FRANçOIS. A final note is you can set DefaultInternalCodepage in your config file, saving you from giving the -internal-codepage option every time. I'd actually recommend doing so, the default is empty for backwards compatibility only. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.