Remove duplicate rows from a large file in Python

前端未结

关注

 6  1241

执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答

轮回少年 (楼主)

2020-12-15 01:14

If you want a really simple way to do this, just create a sqlite database:

import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1,  f2,  f3,  f4,  f5,  f6,  f7,  
            f8,  f9,  f10,  f11,  f12,  f13,  f14,  f15))
"""
conn.commit()

#simplified/pseudo code
for row in reader:
    #assuming row returns a list-type object
    try:
        cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?, 
                       ?, ?, ?, ?, ?, ?, ?, ?)''', row)
        conn.commit()
    except IntegrityError:
        pass

conn.commit()
cur.execute('select * from test')

for row in cur:
    #write row to csv file

Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.

0 讨论(0)

查看其它6个回答