john-users - Re: good program for sorting large wordlists

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <FFA606B202FE4A408C7FA871F3FB4753@apple9d23c8f76>
Date: Wed, 12 Sep 2018 13:14:22 +0200
From: "JohnyKrekan" <krekan@...nykrekan.com>
To: <john-users@...ts.openwall.com>
Subject: Re: good program for sorting large wordlists

Thanx for infos, after I have raised the memory sizes and the space for 
temp, the sort went well. Iwas sorting it to know how many duplicates (when 
ignoring the character case) are in the superwpa wordlist. The original file 
size was approx 10.7 gb, after sorting it was 7.05 gb, so 4 gb was taken by 
the same words with modified character case.
If you could decide: would you rather use this smaller wordlist and set the 
case changing rules in the program which is used to test those hashes or use 
the original wordlist which contains lots of same words with modified 
casing.
Johny Krekan

----- Original Message ----- 
From: "Solar Designer" <solar@...nwall.com>
To: <john-users@...ts.openwall.com>
Sent: Tuesday, September 11, 2018 5:42 PM
Subject: Re: [john-users] good program for sorting large wordlists


> Hi,
>
> On Tue, Sep 11, 2018 at 05:19:18PM +0200, JohnyKrekan wrote:
>> Hello, I would like to ask whether someone has experience with good tool 
>> to sort large text files with possibilities such as gnu sort. I am using 
>> it to sort wordlists but when I tried to sort 11 gb wordlist, it crashed 
>> while writing final output file after writing around 7 gb of data  and 
>> did not delete some temp files. When I was sorting smaller (2gb) wordlist 
>> it took me just about 15 minutes while this 11 gb took 4.5 hours (Intel 
>> core I 7 2.6ghz, 12 gb ram, ssd drives).
>
> Most importantly, usually you do not need to "sort" - you just need to
> eliminate duplicates.  In fact, in many cases you'd prefer to eliminate
> duplicates without sorting, in case your input list is sorted roughly
> for non-increasing estimated probability of hitting a real password -
> e.g., if it's produced by concatenating common/leaked password lists
> first with other general wordlists next, or/and by pre-applying wordlist
> rules (which their authors generally order such that better performing
> rules come first).
>
> You can eliminate duplicates without sorting using JtR's bundled
> "unique" program.  In jumbo and running on a 64-bit platform, it will by
> default use a memory buffer of 2 GB (the maximum it can use).  It does
> not use any temporary files (instead, it reads back the output file
> multiple times if needed).  You can use it e.g. like this:
>
> ./unique output.lst < input.lst
>
> or:
>
> cat ~/wordlists/* | ./unique output.lst
>
> or:
>
> cat ~/wordlists/common/* ~/wordlists/uncommon/* | ./unique output.lst
>
> or:
>
> ./john -w=password.lst --rules=jumbo --stdout | ./unique output.lst
>
> As to sorting, recent GNU sort from the coreutils package works well.
> You'll want to use the "-S" option to let it use more RAM, and less
> temporary files, e.g. "-S 5G".  You can also use e.g. "--parallel=8".
>
> As to it running out of space for the temporary files, perhaps you have
> your /tmp on tmpfs, so in RAM+swap, and this might be too limiting.  If
> so, you may use the "-T" option, e.g. "-T /home/user/tmp", to let it use
> your SSDs instead.  Combine this with e.g. "-S 5G" to also use your RAM.
>
> As to "it crashed while writing final output file after writing around 7
> gb of data", did you possibly put the output file in /tmp as well?  Just
> don't do that.
>
> I hope this helps.
>
> Alexander
>
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.