Sort dataframe rows independently by values in another dataframe

问题

Suppose two dataframes:

import pandas as pd
import numpy as np

d1 = {}
d2 = {}

np.random.seed(5)
for col in list("ABCDEF"):
    d1[col] = np.random.randn(12)
    d2[col+'2'] = np.random.random_integers(0,100, 12)

t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")

dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)

I want to sort dat1's rows by the rows in dat2 and extract a subset of the ordered data from dat1. Below, is an example where the top 5 values per row are extracted from dat1. For example, with:

                   A         B         C         D         E        F
2015-01-31  0.441227 -0.817548 -0.723062 -0.205149  0.230843 -0.25395
2015-02-28 -0.330870 -1.168279 -0.042419 -0.232108 -0.042166  0.42985

            A2  B2  C2  D2  E2  F2
2015-01-31  47  47  82  66  64  40
2015-02-28  30  16  60  57  77  74

I would get:

            0  1  2  3  4
2015-01-31  A  B  E  D  C
2015-02-28  A  D  C  F  E
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419  0.429850 -0.042166

Here is my solution. The biggest issue is that this code does not deal with NA values either in dat1 or dat2 which is an enormous issue that needs to be fixed.

def sortByAnthr(X,Y):
    return([x for (x,y) in sorted(zip(X,Y), key=lambda pair: pair[1])])

def r_selectr(dat2,dat1, n):
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x.index,dat2.loc[x.name,:]),axis=1).iloc[:,-n:]
    ordr_cols.columns = list(range(0,n)) #assign column names

    ordr_r = ordr_cols.apply(lambda x: dat1.ix[x.name,x.values].tolist(),axis=1)
    return([ordr_cols, ordr_r])

ordr_cols,ordr_r = r_selectr(dat2,dat1,5)

ordr_cols.iloc[:2,:]
            0  1  2  3  4
2015-01-31  A  B  E  D  C
2015-02-28  A  D  C  F  E

ordr_r.iloc[:2,:]
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419  0.429850 -0.042166

For example, with NAs, the above fails to sort correctly:

dat1.iloc[[1,2],[1,3,5]]=np.nan
dat2.iloc[[1,4],[2,4,5]]=np.nan

回答1:

Here is my solution. It now handles NAs by intersecting the indexes of non-NA values in dat1 and dat2 for each row. This, however, introduces an issue in apply, whereby apply needs same-sized output for each row. The function that fills items that cannot/were not sorted is fillVacuum.

def fillVacuum(toFill,MatchLengthOf):
    if len(toFill)<len(MatchLengthOf):
       [toFill.insert(i, np.nan) for i in range(len(MatchLengthOf)-len(toFill))]
    return()

def sortByAnthr(X,Y,Xindex):
    #intersect non-na column indexes between two datasets
    idx = np.intersect1d(X.notnull().nonzero()[0],Y.notnull().nonzero()[0])

    #order the subset of X.index by Y
    ordrX = [x for (x,y) in sorted(zip(Xindex[idx],Y[idx]), key=lambda pair: pair[1])]

    #due to molding that'll happen later in apply, it is necessary to fill removed indexes
    fillVacuum(ordrX, Xindex)

    return(ordrX)

def OrderRow(row,df):
    ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
    fillVacuum(ordrd_row, row)
    return(ordrd_row)

def r_selectr(dat2,dat1, n):
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index),axis=1).iloc[:,-n:]
    ordr_cols.columns = list(range(0,n)) #assign interpretable column names

    ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
    return([ordr_cols, ordr_r])

ordr_cols,ordr_r = r_selectr(dat2,dat1,5)

These functions yield the following:

dat1.iloc[:2,:]
                   A         B         C         D         E         F
2015-01-31  0.441227 -0.817548 -0.723062 -0.205149  0.230843 -0.253954
2015-02-28       NaN       NaN -0.042419 -0.232108       NaN  0.429850

dat2.iloc[:2,:]
            A2  B2  C2  D2  E2  F2
2015-01-31  47  47  82  66  64  40
2015-02-28 NaN  16  60  57  77 NaN

ordr_cols.iloc[:2,:]
              0    1    2  3  4
2015-01-31    A    B    E  D  C
2015-02-28  NaN  NaN  NaN  D  C

ordr_r.iloc[:2,:]
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28       NaN       NaN       NaN -0.232108 -0.042419

来源：https://stackoverflow.com/questions/36411724/sort-dataframe-rows-independently-by-values-in-another-dataframe

标签

python

sorting

pandas

indexing