How to drop duplicated rows using pandas in a big data file?

问题

I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'],      encoding='utf-8', chunksize=10000000)

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

But if duplicated rows distribute in different chunk seems like above script can't get the expected results.

Is there any better way？

回答1:

You could try something like this.

First, create your chunker.

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)

Now create a set of ids:

ids = set()

Now iterate over the chunks:

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

However, now, within the body of the loop, drop also ids already in the set of known ids:

    chunk = chunk[~chunk['Author ID'].isin(ids)]

Finally, still within the body of the loop, add the new ids

    ids.update(chunk['Author ID'].values)

If ids is too large to fit into main memory, you might need to use some disk-based database.

来源：https://stackoverflow.com/questions/39365568/how-to-drop-duplicated-rows-using-pandas-in-a-big-data-file

标签

python

database

pandas

bigdata

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!