I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way
If you want a really simple way to do this, just create a sqlite database:
import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1, f2, f3, f4, f5, f6, f7,
f8, f9, f10, f11, f12, f13, f14, f15))
"""
conn.commit()
#simplified/pseudo code
for row in reader:
#assuming row returns a list-type object
try:
cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?,
?, ?, ?, ?, ?, ?, ?, ?)''', row)
conn.commit()
except IntegrityError:
pass
conn.commit()
cur.execute('select * from test')
for row in cur:
#write row to csv file
Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.