how to map two rows of different dataframe based on a condition in pandas

岁酱吖の 提交于 2019-12-11 07:57:43

问题


I have two dataframes,

df1,

 Names
 one two three
 Sri is a good player
 Ravi is a mentor
 Kumar is a cricketer player

df2,

 values
 sri
 NaN
 sri, is
 kumar,cricketer player

I am trying to get the row in df1 which contains the all the items in df2

My expected output is,

 values                  Names
 sri                     Sri is a good player
 NaN
 sri, is                 Sri is a good player
 kumar,cricketer player  Kumar is a cricketer player

i tried, df1["Names"].str.contains("|".join(df2["values"].values.tolist())) I also tried,

but I cannot achieve my expected output as it has (","). Please help


回答1:


Using set logic with Numpy broadcasting.

d1 = df1['Names'].fillna('').str.lower().str.split('[^a-z]+').apply(set).values
d2 = df2['values'].fillna('').str.lower().str.split('[^a-z]+').apply(set).values

i, j = np.where(d1 >= d2[:, None])

df2.assign(Names=pd.Series(df1['Names'].values[j], df2['values'].index[i]))

                   values                        Names
0                     sri         Sri is a good player
1                     NaN                          NaN
2                 sri, is         Sri is a good player
3  kumar,cricketer player  Kumar is a cricketer player



回答2:


Try -

import pandas as pd

df1 = pd.read_csv('sample.csv')
df2 = pd.read_csv('sample_2.csv')

df2['values']= df2['values'].str.lower()
df1['names']= df1['names'].str.lower()

df2["values"] = df2['values'].str.replace('[^\w\s]',' ')
df2['values']= df2['values'].replace('\s+', ' ', regex=True)

df1["names"] = df1['names'].str.replace('[^\w\s]',' ')
df1['names']= df1['names'].replace('\s+', ' ', regex=True)

df2['list_values'] = df2['values'].apply(lambda x: str(x).split())
df1['list_names'] = df1['names'].apply(lambda x: str(x).split())

list_names = df1['list_names'].tolist()

def check_names(x, list_names):
    output = ''
    for list_name in list_names:
        if set(list_name) >= set(x):
            output = ' '.join(list_name)
            break
    return output

df2['Names'] = df2['list_values'].apply(lambda x: check_names(x, list_names))
print(df2)

Output

values                        Names
0                     sri         sri is a good player
1                     NaN                             
2                  sri is         sri is a good player
3  kumar cricketer player  kumar is a cricketer player

Exaplanation

It's a fuzzy matching problem. So here are the steps that I have applied -

  1. Remove punctuations and split to get unique words on both df
  2. Lowercase everything for standardized matching.
  3. Convert by splitting the string into lists.
  4. Finally doing the matching via the check_names() function to get the desired output


来源:https://stackoverflow.com/questions/49022851/how-to-map-two-rows-of-different-dataframe-based-on-a-condition-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!