Record linking two large CSVs in Python?

给你一囗甜甜゛ 提交于 2021-02-10 15:40:41

问题


I'm somewhat new to Pandas and Python Record Linkage Toolkit, so please forgive me if the answer is obvious. I'm trying to cross-reference one large dataset, "CSV_1", against another, "CSV_2", in order to create a third CSV consisting only of matches that concatenates all columns from CSV_1 and CSV_2 regardless of overlap in order to preserve the original record, e.g.

CSV_1                               CSV_2
Name     City     Date              Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11              THE EXAMPLE, LLC  Bton, USA          7/11/2020 00:00
Nomatch  Cton     10/10             huh, inc.         Lton, AMERICA      9/8/2020 00:00

Would output

CSV_3
Name     City     Date    Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11    THE EXAMPLE, LLC  Bton, USA          7/11/2020 00:00

The data is not well structured, and CSV_2 has many more columns than CSV_1, which is why I have been attempting to find fuzzy matches based on the name column with the city column as an index block. Having trouble getting the matching stage to even execute, never mind efficiently, and haven't even tackled the concatenation step. Any help on how to tackle this?

Edit: The files are each very large (both ~1M lines with 8-20 columns, 80-200mb), even loading single columns with pandas is troublesome. For context, this is a data project for a job application which indicated a preference for a 'passing familiarity with Python or R'. Under normal circumstances this title requires no coding knowledge whatsoever, which is why I found it so strange the company decided to assign this complex data problem. Parameters are: Single Python file running locally in a lower-mem (think 2013 Dell Inspiron) environment without modification (i.e. no increasing page file size).


回答1:


For your problem statement and considering the size of the data involved, I recommend loading your data into a database. Then, I would use the following SQL to solve your problem, then I would read the result into my local python env / pandas dataframe:

select *
from csv_1
inner join csv_2
on csv_1.city = csv_2.city_of_origin
where STRPOS( lower(csv_1.name) , lower(csv_2.name_of_thing) )>0
or STRPOS( lower(csv_2.name_of_thing) , lower(csv_1.name) )>0


来源:https://stackoverflow.com/questions/63401778/record-linking-two-large-csvs-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!