How to join two rows that have the same keys and complementary values

一曲冷凌霜 提交于 2020-01-24 21:26:53

问题


My goal is to collapse the below table into one single column and this question deals specifically with the blue row below. The table has three categorical variables and 6 analysis/quantitative variables. Columns C1 and C2 are the only variables that need to match for a successful join. All blank cells are NaNs and python code for copying is below. These rows are exported independently because they have information found in other related tables, not included in the export.

Question. (Blue) Some of the quantitative information in the blue row is also in the grey row and some is not. Is there a way to copy the new information (-8 in Q6) into the grey row and then delete/highlight the blue row? Here, the grey row categorical information is maintained, assuming the keep='first' default of drop_duplicates is active.

Related Question.(Yellow row) How to delete rows that are not exact duplicates but contain no new information (more NaN)

Data table


Expected Output

The expected output would have the grey row updated with Q6 from the blue row and the blue row removed.

[['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,-8]]

Current Progress

My current code includes this line to drop all rows where all quantitative variables are NaN.
df.dropna(subset=df.columns[4:],how='all', inplace=True)

Also, this line for deleting all rows where all quantitative variables are the same.
df.drop_duplicates(subset=df.columns[4:], inplace=True)

Example code that can be copied into an IDE.

import pandas as pd

df = [['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,nan],
 ['S1','P3','H1',Timestamp('2004-12-20 00:00:00'),nan,nan,nan,nan,nan,nan],
 ['S1','P3','H2',Timestamp('2004-12-20 00:00:00'),-15.0,nan,nan,-10.0,nan,nan],
 ['S1','P3','H3',Timestamp('2004-12-07 00:00:00'),nan,nan,nan,nan,-15.0,-8.0],
 ['S1','P3','H1', Timestamp('2004-12-04 00:00:00'), -15.0,-27.4,nan,-10.0, -15.0, nan]]
cols = ['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
pd.DataFrame(data=df,columns=cols)

df.drop_duplicates(inplace=True)
df.dropna(subset=df.columns[4:],how='all', inplace=True)
df.drop_duplicates(subset=df.columns[4:], inplace=True)

回答1:


Split of the categorical columns:

df_categorical = df[['C1 (PK)', 'C2 (FK)',"C3", "C4"]]

Perform a groupby on first 2 columns and select first element to keep:

df_categorical = df_categorical.groupby(["C1 (PK)", "C2 (FK)"]).first()

For the quantitative columns use groupby again and use mean this time:

df_quantitative = df.groupby(['C1 (PK)', 'C2 (FK)']).mean()

merge the two dataframes to get the result

df_final = pd.concat([df_quantitative, df_categorical], axis=1)

reset index

df_final.reset_index(inplace=True)


来源:https://stackoverflow.com/questions/59774087/how-to-join-two-rows-that-have-the-same-keys-and-complementary-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!