Pairwise matrix from a pandas dataframe

前端 未结 2 661
生来不讨喜
生来不讨喜 2020-12-19 21:01

I have a pandas dataframe that looks something like this:

             Al01   BBR60   CA07    NL219
AAEAMEVAT    MP      NaN     MP      MP 
AAFEDLRLL    NaN            


        
相关标签:
2条回答
  • 2020-12-19 21:21

    The operation you are performing can be expressed as an application of np.einsum -- it's an inner product between each pair of columns:

    import numpy as np
    import pandas as pd
    
    df = pd.read_table('data', sep='\s+')
    print(df)
    #   Al01 BBR60 CA07 NL219
    # 0   MP   NaN   MP    MP
    # 1  NaN   NaN  NaN   NaN
    # 2   NP   NaN   NP    NP
    # 3  NaN    NP  NaN   NaN
    # 4  PB1   NaN  NaN   PB1
    # 5  NaN   NaN   NP    NP
    # 6   NP   NaN  NaN   NaN
    
    arr = (~df.isnull()).values.astype('int')
    print(arr)
    # [[1 0 1 1]
    #  [0 0 0 0]
    #  [1 0 1 1]
    #  [0 1 0 0]
    #  [1 0 0 1]
    #  [0 0 1 1]
    #  [1 0 0 0]]
    
    result = pd.DataFrame(np.einsum('ij,ik', arr, arr),
                          columns=df.columns, index=df.columns)
    print(result)
    

    yields

           Al01  BBR60  CA07  NL219
    Al01      4      0     2      3
    BBR60     0      1     0      0
    CA07      2      0     3      3
    NL219     3      0     3      4
    

    Usually when a calculation boils down to a numeric operation independent of indices, it is faster to do it with NumPy than with Pandas. That appears to be the case here:

    In [130]: %timeit df2 = df.applymap(lambda x: int(not pd.isnull(x)));  df2.T.dot(df2)
    1000 loops, best of 3: 1.12 ms per loop
    
    In [132]: %timeit arr = (~df.isnull()).values.astype('int'); pd.DataFrame(np.einsum('ij,ik', arr, arr), columns=df.columns, index=df.columns)
    10000 loops, best of 3: 132 µs per loop
    
    0 讨论(0)
  • 2020-12-19 21:31

    It just matrix multiplication:

    import pandas as pd
    df = pd.read_csv('data.csv',index_col=0, delim_whitespace=True)
    df2 = df.applymap(lambda x: int(not pd.isnull(x)))
    print df2.T.dot(df2)
    

    Output:

               Al01  BBR60  CA07  NL219
    Al01      4      0     2      3
    BBR60     0      1     0      0
    CA07      2      0     3      3
    NL219     3      0     3      4
    
    [4 rows x 4 columns]
    
    0 讨论(0)
提交回复
热议问题