Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Fri, 15 Jul 2011 10:35:16 -0500
From: "JFoug" <>
To: <>
Subject: memory usage question, caused by new Unicode-casing

This question is mostly directed at Solar, but I am posting this to the 

With the new unicode casing, there are 2 somewhat large arrays which are 
placed in the heap in john.  I know we have the --save-memory flag within 
john.  I am not overly familiar with the what/when/why about this flag, but 
I wanted to toss this out.

The current implementation of the unicode casing is this:

These are globals in unicode.c:

UTF16 utc2_upcase[0x10000];
UTF16 utc2_downcase[0x10000];

They are global and not static, due to a couple of macros (inlines) in the 
unicode.h which can be used to access them, anywhere in john code.

Ok, a 30 second overview of how they are used, and how they work:

There is a header file (UnicodeData.h), which contains the data to fill in 
these arrays, and an array of data structures which is used for the 1 to 
many character conversion.  An init function initUnicodeCase() is called 
during john_init(). It properly takes the data from UnicodeData.h, and fills 
out the utc2_up[down]case arrays, building them into properly 'sparse' 
arrays. These arrays hold information about what unicode character each 
specfic one converts to, if you want to up[down]case it (almost all have no 
case).  It may contain information that this character does up[down]case, 
BUT is a multi character.  Then to *case a single character, you simply look 
it up in the utc2_*case[] array.  So, for the letter 'q', you look at 
utc2_upcase[(UTC16)'q'] and you find it upcases to (UTC16)'Q'.  If you look 
up a character, and there is a 1 (binary 1) in the array element, then it 
means this character DOES have an up[down]case, but it requires multiple 
unicode characters, and the conversion is found in the 1-to-many array. In 
this case, that array is walked, until the proper character is found, and 
then the proper 2 or 3 unicode characters making up this 1 character 
up[down] case is known, and can be handled.  The 1-to-many is not sparse, 
like the 1-to-1 array.  There are very few 1-to-many's, and to make a quick 
'sparse' array, would have taken 2^16 elements of 3 Unicode characters each. 
It was deemed 'better' to simply walk the list when one of the very rare 1 
to multi's were observed.

Ok, thats the 30s overview of how it 'works'.

Now for the question.  utc2_up[down]case[] arrays require 256K of  heap.
Is 256K of heap a problem, that should be changed if running in --save-mem 

IF SO, then there will likely have to be substantial changes made to make it 
'work' properly, if in --save-mem mode, while at the same time, trying to 
preserve the speed within the 'normal' mode.   We 'could' fall back to only 
handling 'a'-'z' casing even in -utf8 if in --save-mem, but that pretty much 
nueters certain formats.

Solar, what do you think the best direction for this would be?


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.