Efficient way to aggregate and remove duplicates from very large (password) lists

Context:

I am attempting to combine a large amount of separate password list text files into a single file for use in dictionary based password cracking.
Each text file is line delimited (a single password per line) and there are 82 separate files at the moment. Most (66) files are in the 1-100Mb filesize range, 12 are 100-700Mb, 3 are 2Gb, and 1 (the most problematic) is 11.2Gb.
In total I estimate 1.75 billion non-unique passwords need processing; of these I estimate ~450 million (%25) will be duplicates and ultimately need to be discarded.
I am attempting to do this on a device which has a little over 6Gb of RAM free to play with (i.e. 8Gb with 2Gb already consumed).

Problem:

I need a way to a) aggregate all of these passwords together and b) remove exact duplicates, within my RAM memory constrains and within a reasonable (~7 days, ideally much less but really I don't care if it takes weeks and then I never need to run it again) time window.

I am a competent Python programmer and thus gave it a crack several times already. My most successful attempt used sqlite3 to store processed passwords on the hard disk as it progressed. However this meant that keeping track of which files had already been completed between processing instances (I cancelled and restarted several times to make changes) was tediously achieved by hashing each completed file and maintaining/comparing these each time a new file was opened. For the very large files however, any progress would be lost.

I was processing the text files in blocks of ~1 billion (at most) lines at a time to prevent memory exhaustion without having no feedback for extended periods of time. I know that I could, given a lot of time populate my database fully as I achieved a DB filesize of ~4.5Gb in 24 hours of runtime so I estimate that left to run it would take about 4 days at most to get through everything, but I don't know if/how to most efficiently read/write to it nor do I have any good ideas on how to tackle the removing of duplicates (do it as I am populating the DB or make additional passes afterwards...? Is there a much faster means of doing lookups for uniqueness in a database configuration I don't know about?).

My request here today is for advice / solutions to a programming and optimisation approach on how to achieve my giant, unique password list (ideally with Python). I am totally open to taking a completely different tack if I am off the mark already.

Two nice to haves are:

A way to add more passwords in the future without having to rebuild the whole list; and
A database < 20Gb at the end of all this so that it isn't a huge pain to move around.

Solution

Based on CL's solution which was ultimately a lot more elegant than what I was thinking I came up with a slightly modified method.

Following CL's advice I setup a sqlite3 DB and fed the text files into a Python script which consumed them and then output a command to insert them into the DB. Straight off the bat this ~did~ work but was extremely (infeasibly) slow.

I solved this by a few simple DB optimisations which was much easier to implement and frankly cleaner to just do all from the core Python script included below which builds upon CL's skeleton code. The fact that the original code was generating sooooooo many I/O operations was causing something funny on my (Win7) OS causing BSODs and lost data. I solved this by making the insertion of a whole password file one SQL transaction plus a couple of pragma changes. In the end the code runs at about 30,000 insertions / sec which is not the best but is certainly acceptable for my purposes.

It may be the case that this will still fail on the largest of files but if/when that is the case, I will simply chunk the file down into smaller 1Gb portions and consume them individually.

import sys
import apsw

i = 0
con = apsw.Connection("passwords_test.db")
cur = con.cursor()

cur.execute("CREATE TABLE IF NOT EXISTS Passwords(password TEXT PRIMARY KEY) WITHOUT ROWID;")
cur.execute("PRAGMA journal_mode = MEMORY;")
cur.execute("PRAGMA synchronous = OFF;")

cur.execute("BEGIN TRANSACTION")
for line in sys.stdin:
    escaped = line.rstrip().replace("'", "''")
    cur.execute("INSERT OR IGNORE INTO Passwords VALUES(?);", (escaped,))
    i += 1
    if i % 100000 == 0: # Simple line counter to show how far through a file we are
        print i

cur.execute("COMMIT")
con.close(True)

This code is then run from command line:

insert_passwords.py < passwordfile1.txt

And automated by:

for %%f in (*.txt) do (
insert_passwords.py < %%f
)

All in all, the DB file itself is not growing too quickly, the insertion rate is sufficient, I can break/resume operations at the drop of a hat, duplicate values are being accurately discarded, and the current limiting factor is the lookup speed of the DB not the CPU or disk space.

When storing the passwords in an SQL database, being able to detect duplicates requires an index. This implies that the passwords are stored twice, in the table and in the index.

However, SQLite 3.8.2 or later supports WITHOUT ROWID tables (called "clustered index" or "index-organized tables" in other databases), which avoid the separate index for the primary key.

There is no Python version that already has SQLite 3.8.2 included. If you are not using APSW, you can still use Python to create the SQL commands:

Install the newest sqlite3 command-line shell (download page).

Create a database table:

$ sqlite3 passwords.db
SQLite version 3.8.5 2014-06-02 21:00:34
Enter ".help" for usage hints.
sqlite> CREATE TABLE MyTable(password TEXT PRIMARY KEY) WITHOUT ROWID;
sqlite> .exit

Create a Python script to create the INSERT statements:

import sys
print "BEGIN;"
for line in sys.stdin:
    escaped = line.rstrip().replace("'", "''")
    print "INSERT OR IGNORE INTO MyTable VALUES('%s');" % escaped
print "COMMIT;"

(The INSERT OR IGNORE statement will not insert a row if a duplicate would violate the primary key's unique constraint.)

Insert the passwords by piping the commands into the database shell:

$ python insert_passwords.py < passwords.txt | sqlite3 passwords.db

There is no need to split up input files; fewer transaction have less overhead.

来源：https://stackoverflow.com/questions/24053900/efficient-way-to-aggregate-and-remove-duplicates-from-very-large-password-list

标签

python

sql

sqlite

passwords

large-files