问题
I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
But if duplicated rows distribute in different chunk seems like above script can't get the expected results.
Is there any better way?
回答1:
You could try something like this.
First, create your chunker.
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
Now create a set of ids:
ids = set()
Now iterate over the chunks:
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
However, now, within the body of the loop, drop also ids already in the set of known ids:
chunk = chunk[~chunk['Author ID'].isin(ids)]
Finally, still within the body of the loop, add the new ids
ids.update(chunk['Author ID'].values)
If ids
is too large to fit into main memory, you might need to use some disk-based database.
来源:https://stackoverflow.com/questions/39365568/how-to-drop-duplicated-rows-using-pandas-in-a-big-data-file