Record linking two large CSVs in Python?

问题

I'm somewhat new to Pandas and Python Record Linkage Toolkit, so please forgive me if the answer is obvious. I'm trying to cross-reference one large dataset, "CSV_1", against another, "CSV_2", in order to create a third CSV consisting only of matches that concatenates all columns from CSV_1 and CSV_2 regardless of overlap in order to preserve the original record, e.g.

CSV_1                               CSV_2
Name     City     Date              Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11              THE EXAMPLE, LLC  Bton, USA          7/11/2020 00:00
Nomatch  Cton     10/10             huh, inc.         Lton, AMERICA      9/8/2020 00:00

Would output

CSV_3
Name     City     Date    Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11    THE EXAMPLE, LLC  Bton, USA          7/11/2020 00:00

The data is not well structured, and CSV_2 has many more columns than CSV_1, which is why I have been attempting to find fuzzy matches based on the name column with the city column as an index block. Having trouble getting the matching stage to even execute, never mind efficiently, and haven't even tackled the concatenation step. Any help on how to tackle this?

Edit: The files are each very large (both ~1M lines with 8-20 columns, 80-200mb), even loading single columns with pandas is troublesome. For context, this is a data project for a job application which indicated a preference for a 'passing familiarity with Python or R'. Under normal circumstances this title requires no coding knowledge whatsoever, which is why I found it so strange the company decided to assign this complex data problem. Parameters are: Single Python file running locally in a lower-mem (think 2013 Dell Inspiron) environment without modification (i.e. no increasing page file size).

回答1:

For your problem statement and considering the size of the data involved, I recommend loading your data into a database. Then, I would use the following SQL to solve your problem, then I would read the result into my local python env / pandas dataframe:

select *
from csv_1
inner join csv_2
on csv_1.city = csv_2.city_of_origin
where STRPOS( lower(csv_1.name) , lower(csv_2.name_of_thing) )>0
or STRPOS( lower(csv_2.name_of_thing) , lower(csv_1.name) )>0

来源：https://stackoverflow.com/questions/63401778/record-linking-two-large-csvs-in-python

标签

python

pandas

record-linkage