问题
Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.
I was recommended of using difflib to create an artificial key column to merge on.
This new column is called 'name'. difflib.get_close_matches looks for similar strings in df2.
This works well when all rows in the 'CandidateName' column are present but I get IndexError: list index out of range when a cell is missing.
I tried resolving this issue by filling in the empty column with the string 'EMPTY'. However the same error still occurs.
# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
Expected
print(df1)
# CandidateName State District Party Name
#0 Theodorick A. Bland VA 9 Theodorick Bland
#1 Aedanus Rutherford Burke SC 2 Aedanus Burke
#2 Jason Lewis MN 2 Jason Lewis
#3 Barbara Comstock VA 10 Democrat Barbara Comstock
#4 Theodorick Bland VA 9 Theodorick Bland
#5 Aedanus Burke SC 2 Aedanus Burke
#6 Jason Initial Lewis MN 2 Democrat Jason Lewis
#7 NH 1 Whig
#8 NH 1 Whig
Actual Error Result:
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
IndexError: list index out of range
回答1:
You are getting a list type object back. And these lists dont have index 0. Thats why you get this error. Second of all, we need to convert these lists to type string to be able to do the merge like following:
note: you dont have to use: df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
7 NH 1 Whig
8 NH 1 Whig
Note I added how='left' argument to our merge since you want to keep the shape of your original dataframe.
Explanation of ''.join()
We do this to convert the list to string, see example:
lst = ['hello', 'world']
print(' '.join(lst))
'hello world'
来源:https://stackoverflow.com/questions/55445922/how-can-i-create-an-artificial-key-column-for-merging-two-datasets-using-difflab