问题
Suppose two dataframes:
import pandas as pd
import numpy as np
d1 = {}
d2 = {}
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
I want to sort dat1's rows by the rows in dat2 and extract a subset of the ordered data from dat1. Below, is an example where the top 5 values per row are extracted from dat1. For example, with:
A B C D E F
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.25395
2015-02-28 -0.330870 -1.168279 -0.042419 -0.232108 -0.042166 0.42985
A2 B2 C2 D2 E2 F2
2015-01-31 47 47 82 66 64 40
2015-02-28 30 16 60 57 77 74
I would get:
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 A D C F E
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419 0.429850 -0.042166
Here is my solution. The biggest issue is that this code does not deal with NA values either in dat1 or dat2 which is an enormous issue that needs to be fixed.
def sortByAnthr(X,Y):
return([x for (x,y) in sorted(zip(X,Y), key=lambda pair: pair[1])])
def r_selectr(dat2,dat1, n):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x.index,dat2.loc[x.name,:]),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign column names
ordr_r = ordr_cols.apply(lambda x: dat1.ix[x.name,x.values].tolist(),axis=1)
return([ordr_cols, ordr_r])
ordr_cols,ordr_r = r_selectr(dat2,dat1,5)
ordr_cols.iloc[:2,:]
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 A D C F E
ordr_r.iloc[:2,:]
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419 0.429850 -0.042166
For example, with NAs, the above fails to sort correctly:
dat1.iloc[[1,2],[1,3,5]]=np.nan
dat2.iloc[[1,4],[2,4,5]]=np.nan
回答1:
Here is my solution. It now handles NAs by intersecting the indexes of non-NA values in dat1 and dat2 for each row. This, however, introduces an issue in apply, whereby apply needs same-sized output for each row. The function that fills items that cannot/were not sorted is fillVacuum.
def fillVacuum(toFill,MatchLengthOf):
if len(toFill)<len(MatchLengthOf):
[toFill.insert(i, np.nan) for i in range(len(MatchLengthOf)-len(toFill))]
return()
def sortByAnthr(X,Y,Xindex):
#intersect non-na column indexes between two datasets
idx = np.intersect1d(X.notnull().nonzero()[0],Y.notnull().nonzero()[0])
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex[idx],Y[idx]), key=lambda pair: pair[1])]
#due to molding that'll happen later in apply, it is necessary to fill removed indexes
fillVacuum(ordrX, Xindex)
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
fillVacuum(ordrd_row, row)
return(ordrd_row)
def r_selectr(dat2,dat1, n):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
ordr_cols,ordr_r = r_selectr(dat2,dat1,5)
These functions yield the following:
dat1.iloc[:2,:]
A B C D E F
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.253954
2015-02-28 NaN NaN -0.042419 -0.232108 NaN 0.429850
dat2.iloc[:2,:]
A2 B2 C2 D2 E2 F2
2015-01-31 47 47 82 66 64 40
2015-02-28 NaN 16 60 57 77 NaN
ordr_cols.iloc[:2,:]
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 NaN NaN NaN D C
ordr_r.iloc[:2,:]
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 NaN NaN NaN -0.232108 -0.042419
来源:https://stackoverflow.com/questions/36411724/sort-dataframe-rows-independently-by-values-in-another-dataframe