Pandas fuzzy detect duplicates

醉酒当歌 提交于 2020-03-17 16:54:28

问题


How can use fuzzy matching in pandas to detect duplicate rows (efficiently)

How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?


回答1:


Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.




回答2:


There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe

(I am a developer of the original dedupe library, but not the pandas-dedupe package)



来源:https://stackoverflow.com/questions/39490190/pandas-fuzzy-detect-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!