Fastest way to perform complex search on pandas dataframe

后端 未结 2 642
广开言路
广开言路 2020-12-13 21:10

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

相关标签:
2条回答
  • 2020-12-13 21:35

    Here's a NumPy solution, which might be convenient in the case performance is relevant:

    def remove_middle_dest(df):
        x = df.to_numpy()
        # obtain a flat numpy array from both columns
        b = x[:,0:2].ravel()
        _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
        # Index of duplicate values in b
        ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
        # Indices to be used to replace the content in the columns
        replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
        # Col index of where duplicate value is, 0 or 1
        col = (ixs_drop % 2) ^ 1
        # 2d array to index and replace values in the df
        # index to obtain values with which to replace
        keep_cols = np.broadcast_to([3,5],(len(col),2))
        ixs = np.concatenate([col[:,None], keep_cols], 1)
        # translate indices to row indices
        rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
        c = np.empty((len(col), 5), dtype=x.dtype)
        c[:,::2] = x[rows_drop[:,None], ixs]
        c[:,1::2] = x[rows_replace[:,None], [2,4]]
        # update dataframe and drop rows
        df.iloc[rows_replace, 1:] = c
        return df.drop(rows_drop)
    

    Which fo the proposed dataframe yields the expected output:

    print(df)
        flightTo flightFrom  toNum  fromNum  toCode  fromCode
    0      ABC        DEF    123      456    8000      8000
    1      DEF        XYZ    456      893    9999      9999
    2      AAA        BBB    473      917    5555      5555
    3      BBB        CCC    917      341    5555      5555
    
    remove_middle_dest(df)
    
        flightTo flightFrom  toNum  fromNum  toCode  fromCode
    0      ABC        XYZ    123      893    8000      9999
    2      AAA        CCC    473      341    5555      5555
    

    This approach does not assume any particular order in terms of the rows where the duplicate is, and the same applies to the columns (to cover the edge case described in the question). If we use for instance the following dataframe:

        flightTo flightFrom  toNum  fromNum  toCode  fromCode
    0      ABC        DEF    123      456    8000      8000
    1      XYZ        DEF    893      456    9999      9999
    2      AAA        BBB    473      917    5555      5555
    3      BBB        CCC    917      341    5555      5555
    
    remove_middle_dest(df)
    
         flightTo flightFrom  toNum  fromNum  toCode  fromCode
    0      ABC        XYZ    123      456    8000      9999
    2      AAA        CCC    473      341    5555      5555
    
    0 讨论(0)
  • 2020-12-13 21:37

    This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

    import networkx as nx
    G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
    
    # create the nx object from pandas dataframe
    
    l=list(nx.connected_components(G))
    
    # then we get the list of components which as tied to each other , 
    # in a net work graph , they are linked 
    L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
    
    # then from the above we can create our map dict , 
    # since every components connected to each other , 
    # then we just need to pick of of them as key , then map with others
    
    d={k: v for d in L for k, v in d.items()}
    
    # create the dict for groupby , since we need _from as first item and _to as last item 
    grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
    df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 
    
    Out[22]: 
             flightTo flightFrom  toNum  fromNum  toCode  fromCode
    flightTo                                                      
    0             ABC        XYZ    123      893    8000      9999
    1             AAA        CCC    473      341    5555      5555
    

    Installation networkx

    • Pip: pip install networkx
    • Anaconda: conda install -c anaconda networkx
    0 讨论(0)
提交回复
热议问题