Date: Thu, 8 Sep 2011 16:13:33 -0500 From: "jfoug" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: Question over validity of concept in JtR's benchmarking code. I have been re-working the pkzip format. I have it running quite a bit faster, with a huge improvement in the 1 byte checksum type zips (which ARE the most common, outside of the unix infoZip built zip files). On my test machine, I have it running well, in real world tests. However, benchmark testing was REALLY off, WAY below what I was seeing in real world (7250k real world, vs . 45k benchmark) Here is the reasoning why, and then a description of what the pkzip format is doing under the hood. The reason the timings are SO far off, is that in the benchmark code, we are always providing the correct password to the format in the inner testing loop. Here is the question I have from the title of this message: Is sending correct passwords to the benchmark functions a valid way to benchmark? In real testing, we assume that a correct password, is a VERY rare event. There may be billions, or 100s of billions of passwords tested, before a correct password is found (if ever). Would it not make more sense, to provide incorrect passwords when benchmarking the format? NOTE, the description below is the current (WIP) pkzip_fmt logic. It is very long winded. Skip it if you are not interested. It was included to add a real world example to the above question. Now, how pkzip is setup. Note, an incorrect password, is several orders of magnitude quicker within the crypt function than the correct one (the actual amount of difference will depend upon the size of the smallest encrypted zip blob that was in the zip file). In the pkzip format, testing for correct password, requires a full decryption of the encrypted zip blob of data, and then inflating that compressed zip blob, and then performing a CRC32 against the decompressed data, to make sure it checks out. If the inflation works, and gives the proper sized file blob, and the CRC32 of the data gives the correct data, then we have a valid password crack. This is SLOW, and can be orders of magnitude slower than even MsCash2, if the zip blob is large. Thus, to make this format work faster, I have used multiple levels of testing. I first take the knowledge that the last byte (or 2) of the 12 byte IV, is a known value. This is used by unzip programs to not have to query for a password multiple times, and processing multiple encrypted files in the same zip (some unzip programs even remember the password between files and sessions). Also it is a quick check within the unzip, so it does not have to decrypt megabytes (or more) of data, and inflate it, (which can take a good long while), only to then be able to inform the user that they have entered the wrong password. Now, just because a password causes the decryption of the IV to have the last byte (or 2) to match the crc value, does not mean this is the right password, but it is good enough for the unzipper to continue forward. I have taken this knowledge as my first level of checking. What happens, is I have to convert the password into the key, then run the 12 bytes of IF through the decrypt and key update functions. Then the last byte (or 2) is checked. This one quick check alone, will remove 255/256 bogus passwords (or 65535/65536 for 2 byte checksum). I then perform what I am calling a level 2 checking. In this checking, I look at the very first inflate code, and act accordingly. This code is 2 bits (the 2nd and 3rd bit of the very first byte of the decrypted zip stream). If the code is a 3 (both bits set), then I know this cannot be the password, since that is not a valid inflate code (only 0, 1 and 2 are valid). This simple 1 byte check will remove 25% of anything getting past the initial level 1 checksum testing, while only requiring 1 additional byte of decryption. If the zip code is a 0 (a store), then I can decrypt 4 more bytes, (which is a 2 byte length, with a 2 byte checksum), and make sure the checksum matches. For the type 0, I also only allow it to proceed, if the first byte is 0 or 1. Since the deflate/inflate algorithm will discard the remaining 5 bits of that first byte, I am making an assumption that all zippers would have initially zerod out that byte. I strongly believe that is a valid assumption. Now, if the code was a 0, only about 1 out of (2^16 * 2^5) will pass. Thus, we have pretty much effectively remove another ¼ of the possible bad passwords. I also have code which handles the code type 1 (fixed Huffman table), and code type 2, (variable Huffman table). The variable Huffman checking only requires 10 bytes to be decrypted (in worse case). The fixed Huffman code, I have to decrypt more bytes, and actually perform the inflate algorithm (using the fixed tables), until I either arrive at a bad code, a bad reverse lookup (looking too far back in the data stream), or I run out of data. I am currently testing whether 24 bytes, or 36 bytes is the right amount. It is somewhere in that range. So, for level 2, I perform some data validation within a single inflate code block. I do not allocate any memory, or do any of the overhead which a real inflate handler would do. The level 2 is checking is a new thing being done with the new pkzip format I will be releasing soon (still have a nit or 2 to get worked out). There used to be (and still is, but is #if commented out) code that would perform a small ASCII test, and also was code added which would search the start of files for magic values, if the hash string told the pkzip_fmt to do this. The logic on what magic to use was put into zip2john. So a .zip file inside of a .zip file would have a specific 4 byte magic signature to match. A .gif file would have 2 signatures (GIF87a or GIF89a). NOTE, this code has been commented out of the current format, due to level 2 AND level 3 checking, AND due to the magic signature could be wrong. Remember the file IS encrypted, so just because it is called a .DOC, or .BMP, or .GIF, does not mean that the file really IS that type, and if we guess wrong, and compare the starting decrypted/inflated bytes against what we thought it was supposed to be, but in actuality it is NOT, then we will never be able to crack the password. So this code is still there, but simply removed from the build right now. Level 3 checking, is simply to decrypt a little more of the file (currently this is 140 bytes, and is ONLY done, if the zip blob is larger than 200 bytes). This decrypted data is then sent to zlib. If there is any error at all (any return other than Z_OK), then we know the zip data was bogus, and we have the wrong password. Note, that most of the bogus passwords which passed the checksum validation (level 1), have already been removed in level 2 (which does much of the same error checking as in the real zlib inflate function). However we are passing a larger buffer to inflate, and it finds ALMOST all invalid items. Finally, a level 4 test is performed. This is the real test. The format decrypts the entire zip blob, and uses zlib to inflate this properly, and the inflated data is CRC32 checked. If the data decrypts with no errors, its length is correct, and the data CRC32s to the proper value, then we are assured of having the correct password. At this time, ALL 4 levels of checking are being done within the crypt function. The prior version of pkzip_fmt (the one that has been released on the wiki), only perform level 1 in the crypt() function, and perform the ascii/magic testing, and then the level 4 testing, within the cmp_all() function. However, doing this, we lose multi-threading. The current crypt() does all work incrementally. It continues to decrypt more and more bytes, the deeper into the function we go, and then continues to use the zlib buffer deeper and deeper, AS LONG AS this password continues to appear to be the right one. Thus, when bench gives only the right passwords, it causes the format to always have to perform the worst case scenario (CPU cycle wise and memory allocation wise), which is a password was found. NOTE, it will be very likely that the level 4 (the full .zip checking), WILL be pulled out from the crypt function, and put into the cmp_all function, for several technical reasons. However, since that is supposed to be a VERY RARE event, this should make no impact at all on the speed of the format. Sorry for the long winded email. I wanted to post some information on the howto/why the format is built the way it is, and also post about the bench sending correct passwords (and this email killed both birds with one stone). Jim. Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.