Remove duplicate rows from a large file in Python

前端 未结 6 1241
执笔经年
执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答
  •  轮回少年
    2020-12-15 01:14

    If you want a really simple way to do this, just create a sqlite database:

    import sqlite3
    conn = sqlite3.connect('single.db')
    cur = conn.cursor()
    cur.execute("""create table test(
    f1 text,
    f2 text,
    f3 text,
    f4 text,
    f5 text,
    f6 text,
    f7 text,
    f8 text,
    f9 text,
    f10 text,
    f11 text,
    f12 text,
    f13 text,
    f14 text,
    f15 text,
    primary key(f1,  f2,  f3,  f4,  f5,  f6,  f7,  
                f8,  f9,  f10,  f11,  f12,  f13,  f14,  f15))
    """
    conn.commit()
    
    #simplified/pseudo code
    for row in reader:
        #assuming row returns a list-type object
        try:
            cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?, 
                           ?, ?, ?, ?, ?, ?, ?, ?)''', row)
            conn.commit()
        except IntegrityError:
            pass
    
    conn.commit()
    cur.execute('select * from test')
    
    for row in cur:
        #write row to csv file
    

    Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.

提交回复
热议问题