Pandas - Merge one dataframe with itself only partially

会有一股神秘感。 提交于 2020-03-04 23:16:06

问题


This is a follow up question from the following Question: Pandas Similarity Matching

The ultimate goal of the first question was to find a way to similarity match each row with another if they have the same CountryId.

Here is the sample dataframe:

 df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])

The answer in other thread was good for the question but I ended up getting computational problems. My real source contains >19.000 rows and will be even bigger in the future.

The answer suggested to merge the dataframe with each self to compare it with every other row that has the same CountryId:

df = df.merge(df, on='CountryId', how='outer')  

Even for the small example of 15 rows provided above we will end up with 225 merged rows. For the whole dataset I ended up with 131.044.638 rows which made my RAM refuse to work. Therefore I need to think of a better way to mergethe two dataframes.

As I´m doing a similarity check I was wondering if there is a possibility to:

  1. Sort the dataframe based on the CountryId and the Name

  2. Only merge each row with the +/- 3 rows connecting. E.g. After sorting Row 1 will only be merged with (2,3 & 4) as this is the first row., Row 2 will only be merged with (1, 3, 4, 5) and so on.

Like this I will have similar names almost next to each other and names "further away" will not be similar anyway. Therefore its not needed to check the similarity of them.


回答1:


I found a workaround for my problem that is taking the 3 rows before (if existing) and after.

sorted_df = df.sort_values(by=['CountryId','Name']).reset_index(drop=True)
new_sorted = pd.Series()
min = -3
max = 3
for s in list(range(min,max+1,1)):
    if s == min:
        new_sorted = sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')
    elif s != 0:
        new_sorted = new_sorted + '-' + sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')


match = sorted_df.merge(new_sorted,left_index=True,right_index=True)

matching_df = []
for index, row in match.iterrows():
    row_values = row.tolist()
    matching_df += [row_values[0:-1] + [int(w)] for w in row_values[-1].split('-') if w != 'A']

If anyone can come up with a better idea I would be glad to hear it!



来源:https://stackoverflow.com/questions/59784767/pandas-merge-one-dataframe-with-itself-only-partially

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!