Sort dataframe rows independently by values in another dataframe

情到浓时终转凉″ 提交于 2019-12-12 03:28:49

问题


Suppose two dataframes:

import pandas as pd
import numpy as np

d1 = {}
d2 = {}

np.random.seed(5)
for col in list("ABCDEF"):
    d1[col] = np.random.randn(12)
    d2[col+'2'] = np.random.random_integers(0,100, 12)

t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")

dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)

I want to sort dat1's rows by the rows in dat2 and extract a subset of the ordered data from dat1. Below, is an example where the top 5 values per row are extracted from dat1. For example, with:

                   A         B         C         D         E        F
2015-01-31  0.441227 -0.817548 -0.723062 -0.205149  0.230843 -0.25395
2015-02-28 -0.330870 -1.168279 -0.042419 -0.232108 -0.042166  0.42985

            A2  B2  C2  D2  E2  F2
2015-01-31  47  47  82  66  64  40
2015-02-28  30  16  60  57  77  74

I would get:

            0  1  2  3  4
2015-01-31  A  B  E  D  C
2015-02-28  A  D  C  F  E
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419  0.429850 -0.042166

Here is my solution. The biggest issue is that this code does not deal with NA values either in dat1 or dat2 which is an enormous issue that needs to be fixed.

def sortByAnthr(X,Y):
    return([x for (x,y) in sorted(zip(X,Y), key=lambda pair: pair[1])])

def r_selectr(dat2,dat1, n):
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x.index,dat2.loc[x.name,:]),axis=1).iloc[:,-n:]
    ordr_cols.columns = list(range(0,n)) #assign column names

    ordr_r = ordr_cols.apply(lambda x: dat1.ix[x.name,x.values].tolist(),axis=1)
    return([ordr_cols, ordr_r])

ordr_cols,ordr_r = r_selectr(dat2,dat1,5)

ordr_cols.iloc[:2,:]
            0  1  2  3  4
2015-01-31  A  B  E  D  C
2015-02-28  A  D  C  F  E

ordr_r.iloc[:2,:]
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419  0.429850 -0.042166

For example, with NAs, the above fails to sort correctly:

dat1.iloc[[1,2],[1,3,5]]=np.nan
dat2.iloc[[1,4],[2,4,5]]=np.nan

回答1:


Here is my solution. It now handles NAs by intersecting the indexes of non-NA values in dat1 and dat2 for each row. This, however, introduces an issue in apply, whereby apply needs same-sized output for each row. The function that fills items that cannot/were not sorted is fillVacuum.

def fillVacuum(toFill,MatchLengthOf):
    if len(toFill)<len(MatchLengthOf):
       [toFill.insert(i, np.nan) for i in range(len(MatchLengthOf)-len(toFill))]
    return()

def sortByAnthr(X,Y,Xindex):
    #intersect non-na column indexes between two datasets
    idx = np.intersect1d(X.notnull().nonzero()[0],Y.notnull().nonzero()[0])

    #order the subset of X.index by Y
    ordrX = [x for (x,y) in sorted(zip(Xindex[idx],Y[idx]), key=lambda pair: pair[1])]

    #due to molding that'll happen later in apply, it is necessary to fill removed indexes
    fillVacuum(ordrX, Xindex)

    return(ordrX)

def OrderRow(row,df):
    ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
    fillVacuum(ordrd_row, row)
    return(ordrd_row)

def r_selectr(dat2,dat1, n):
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index),axis=1).iloc[:,-n:]
    ordr_cols.columns = list(range(0,n)) #assign interpretable column names

    ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
    return([ordr_cols, ordr_r])

ordr_cols,ordr_r = r_selectr(dat2,dat1,5)

These functions yield the following:

dat1.iloc[:2,:]
                   A         B         C         D         E         F
2015-01-31  0.441227 -0.817548 -0.723062 -0.205149  0.230843 -0.253954
2015-02-28       NaN       NaN -0.042419 -0.232108       NaN  0.429850

dat2.iloc[:2,:]
            A2  B2  C2  D2  E2  F2
2015-01-31  47  47  82  66  64  40
2015-02-28 NaN  16  60  57  77 NaN

ordr_cols.iloc[:2,:]
              0    1    2  3  4
2015-01-31    A    B    E  D  C
2015-02-28  NaN  NaN  NaN  D  C

ordr_r.iloc[:2,:]
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28       NaN       NaN       NaN -0.232108 -0.042419


来源:https://stackoverflow.com/questions/36411724/sort-dataframe-rows-independently-by-values-in-another-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!