john-dev - memory usage question, caused by new Unicode-casing

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <E51DC23D450A4D6D836D04B6DD43624D@ath64dual>
Date: Fri, 15 Jul 2011 10:35:16 -0500
From: "JFoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: memory usage question, caused by new Unicode-casing

This question is mostly directed at Solar, but I am posting this to the 
list.

With the new unicode casing, there are 2 somewhat large arrays which are 
placed in the heap in john.  I know we have the --save-memory flag within 
john.  I am not overly familiar with the what/when/why about this flag, but 
I wanted to toss this out.

The current implementation of the unicode casing is this:

These are globals in unicode.c:

UTF16 utc2_upcase[0x10000];
UTF16 utc2_downcase[0x10000];

They are global and not static, due to a couple of macros (inlines) in the 
unicode.h which can be used to access them, anywhere in john code.

Ok, a 30 second overview of how they are used, and how they work:

There is a header file (UnicodeData.h), which contains the data to fill in 
these arrays, and an array of data structures which is used for the 1 to 
many character conversion.  An init function initUnicodeCase() is called 
during john_init(). It properly takes the data from UnicodeData.h, and fills 
out the utc2_up[down]case arrays, building them into properly 'sparse' 
arrays. These arrays hold information about what unicode character each 
specfic one converts to, if you want to up[down]case it (almost all have no 
case).  It may contain information that this character does up[down]case, 
BUT is a multi character.  Then to *case a single character, you simply look 
it up in the utc2_*case[] array.  So, for the letter 'q', you look at 
utc2_upcase[(UTC16)'q'] and you find it upcases to (UTC16)'Q'.  If you look 
up a character, and there is a 1 (binary 1) in the array element, then it 
means this character DOES have an up[down]case, but it requires multiple 
unicode characters, and the conversion is found in the 1-to-many array. In 
this case, that array is walked, until the proper character is found, and 
then the proper 2 or 3 unicode characters making up this 1 character 
up[down] case is known, and can be handled.  The 1-to-many is not sparse, 
like the 1-to-1 array.  There are very few 1-to-many's, and to make a quick 
'sparse' array, would have taken 2^16 elements of 3 Unicode characters each. 
It was deemed 'better' to simply walk the list when one of the very rare 1 
to multi's were observed.

Ok, thats the 30s overview of how it 'works'.

Now for the question.  utc2_up[down]case[] arrays require 256K of  heap.
Is 256K of heap a problem, that should be changed if running in --save-mem 
mode?

IF SO, then there will likely have to be substantial changes made to make it 
'work' properly, if in --save-mem mode, while at the same time, trying to 
preserve the speed within the 'normal' mode.   We 'could' fall back to only 
handling 'a'-'z' casing even in -utf8 if in --save-mem, but that pretty much 
nueters certain formats.

Solar, what do you think the best direction for this would be?

Jim.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.