How to drop duplicated rows using pandas in a big data file?

試著忘記壹切 提交于 2019-12-01 22:15:33

You could try something like this.

First, create your chunker.

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)

Now create a set of ids:

ids = set()

Now iterate over the chunks:

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

However, now, within the body of the loop, drop also ids already in the set of known ids:

    chunk = chunk[~chunk['Author ID'].isin(ids)]

Finally, still within the body of the loop, add the new ids

    ids.update(chunk['Author ID'].values)

If ids is too large to fit into main memory, you might need to use some disk-based database.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!