Efficient way to aggregate and remove duplicates from very large (password) lists

 ̄綄美尐妖づ 提交于 2019-12-04 20:56:41

When storing the passwords in an SQL database, being able to detect duplicates requires an index. This implies that the passwords are stored twice, in the table and in the index.

However, SQLite 3.8.2 or later supports WITHOUT ROWID tables (called "clustered index" or "index-organized tables" in other databases), which avoid the separate index for the primary key.

There is no Python version that already has SQLite 3.8.2 included. If you are not using APSW, you can still use Python to create the SQL commands:

  1. Install the newest sqlite3 command-line shell (download page).
  2. Create a database table:

    $ sqlite3 passwords.db
    SQLite version 3.8.5 2014-06-02 21:00:34
    Enter ".help" for usage hints.
    sqlite> CREATE TABLE MyTable(password TEXT PRIMARY KEY) WITHOUT ROWID;
    sqlite> .exit
    
  3. Create a Python script to create the INSERT statements:

    import sys
    print "BEGIN;"
    for line in sys.stdin:
        escaped = line.rstrip().replace("'", "''")
        print "INSERT OR IGNORE INTO MyTable VALUES('%s');" % escaped
    print "COMMIT;"
    

    (The INSERT OR IGNORE statement will not insert a row if a duplicate would violate the primary key's unique constraint.)

  4. Insert the passwords by piping the commands into the database shell:

    $ python insert_passwords.py < passwords.txt | sqlite3 passwords.db
    

There is no need to split up input files; fewer transaction have less overhead.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!