How to populate columns of a dataframe using a subset of another dataframe?

与世无争的帅哥 提交于 2021-02-07 15:31:57

问题


I have two dataframes like this

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'key': list('AAABBCCAAC'),
    'prop1': list('xyzuuyxzzz'),
    'prop2': list('mnbnbbnnnn')
})

df2 = pd.DataFrame({
    'key': list('ABBCAA'),
    'prop1': [np.nan] * 6,
    'prop2': [np.nan] * 6,
    'keep_me': ['stuff'] * 6
})

  key prop1 prop2
0   A     x     m
1   A     y     n
2   A     z     b
3   B     u     n
4   B     u     b
5   C     y     b
6   C     x     n
7   A     z     n
8   A     z     n
9   C     z     n

  key  prop1  prop2 keep_me
0   A    NaN    NaN   stuff
1   B    NaN    NaN   stuff
2   B    NaN    NaN   stuff
3   C    NaN    NaN   stuff
4   A    NaN    NaN   stuff
5   A    NaN    NaN   stuff

I now want to populate columns prop1 and prop2 in df2 using the values of df1. For each key, we will have more or equal rows in df1 than in df2 (in the example above: 5 times A vs 3 times A, 2 times B vs 2 times B and 3 times C vs 1 time C). For each key, I want to fill df2 using the first n rows per key from df1.

So, my expected outcome for df2 would be:

  key prop1 prop2 keep_me
0   A     x     m   stuff
1   B     u     n   stuff
2   B     u     b   stuff
3   C     y     b   stuff
4   A     y     n   stuff
5   A     z     b   stuff

As key is not unique, I cannot simple build a dictionary and then use .map.

I was hoping that something along these lines would work:

pd.concat([df2.set_index('key'), df1.set_index('key')], axis=1, join='inner')

but that fails with

ValueError: Shape of passed values is (5, 22), indices imply (5, 10)

as - I guess - the index contains non-unique values.

How can I get my desired output?


回答1:


Because duplicates in key values possible solution is create new counter columns in both DataFrames by GroupBy.cumcount, so possible replace missing values from df2 with align by MultiIndex created by key and g columns with DataFrame.fillna:

df1['g'] = df1.groupby('key').cumcount()
df2['g'] = df2.groupby('key').cumcount()

print (df1)
  key prop1 prop2  g
0   A     x     m  0
1   A     y     n  1
2   A     z     b  2
3   B     u     n  0
4   B     u     b  1
5   C     y     b  0
6   C     x     n  1
7   A     z     n  3
8   A     z     n  4
9   C     z     n  2

print (df2)
  key  prop1  prop2 keep_me  g
0   A    NaN    NaN   stuff  0
1   B    NaN    NaN   stuff  0
2   B    NaN    NaN   stuff  1
3   C    NaN    NaN   stuff  0
4   A    NaN    NaN   stuff  1
5   A    NaN    NaN   stuff  2

df = (df2.set_index(['key','g'])
        .fillna(df1.set_index(['key','g']))
        .reset_index(level=1, drop=True)
        .reset_index())
print (df)
  key prop1 prop2 keep_me
0   A     x     m   stuff
1   B     u     n   stuff
2   B     u     b   stuff
3   C     y     b   stuff
4   A     y     n   stuff
5   A     z     b   stuff



回答2:


Another solution to build a dict first from df1 and then pop the elements to fill the NAs in df2

d = df1.groupby(by='key').apply(lambda x: x.values.tolist()).to_dict()
df2[['key','prop1','prop2']] = pd.DataFrame(df2.key.apply(lambda x: d[x].pop(0)).tolist())

    key prop1   prop2   keep_me
0   A   x       m       stuff
1   B   u       n       stuff
2   B   u       b       stuff
3   C   y       b       stuff
4   A   y       n       stuff
5   A   z       b       stuff


来源:https://stackoverflow.com/questions/57787736/how-to-populate-columns-of-a-dataframe-using-a-subset-of-another-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!