Date: Fri, 15 Jul 2011 10:35:16 -0500 From: "JFoug" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: memory usage question, caused by new Unicode-casing This question is mostly directed at Solar, but I am posting this to the list. With the new unicode casing, there are 2 somewhat large arrays which are placed in the heap in john. I know we have the --save-memory flag within john. I am not overly familiar with the what/when/why about this flag, but I wanted to toss this out. The current implementation of the unicode casing is this: These are globals in unicode.c: UTF16 utc2_upcase[0x10000]; UTF16 utc2_downcase[0x10000]; They are global and not static, due to a couple of macros (inlines) in the unicode.h which can be used to access them, anywhere in john code. Ok, a 30 second overview of how they are used, and how they work: There is a header file (UnicodeData.h), which contains the data to fill in these arrays, and an array of data structures which is used for the 1 to many character conversion. An init function initUnicodeCase() is called during john_init(). It properly takes the data from UnicodeData.h, and fills out the utc2_up[down]case arrays, building them into properly 'sparse' arrays. These arrays hold information about what unicode character each specfic one converts to, if you want to up[down]case it (almost all have no case). It may contain information that this character does up[down]case, BUT is a multi character. Then to *case a single character, you simply look it up in the utc2_*case array. So, for the letter 'q', you look at utc2_upcase[(UTC16)'q'] and you find it upcases to (UTC16)'Q'. If you look up a character, and there is a 1 (binary 1) in the array element, then it means this character DOES have an up[down]case, but it requires multiple unicode characters, and the conversion is found in the 1-to-many array. In this case, that array is walked, until the proper character is found, and then the proper 2 or 3 unicode characters making up this 1 character up[down] case is known, and can be handled. The 1-to-many is not sparse, like the 1-to-1 array. There are very few 1-to-many's, and to make a quick 'sparse' array, would have taken 2^16 elements of 3 Unicode characters each. It was deemed 'better' to simply walk the list when one of the very rare 1 to multi's were observed. Ok, thats the 30s overview of how it 'works'. Now for the question. utc2_up[down]case arrays require 256K of heap. Is 256K of heap a problem, that should be changed if running in --save-mem mode? IF SO, then there will likely have to be substantial changes made to make it 'work' properly, if in --save-mem mode, while at the same time, trying to preserve the speed within the 'normal' mode. We 'could' fall back to only handling 'a'-'z' casing even in -utf8 if in --save-mem, but that pretty much nueters certain formats. Solar, what do you think the best direction for this would be? Jim.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.