Fastest way to calculate difference in all columns

前端 未结 4 1928
我在风中等你
我在风中等你 2020-12-10 09:38

I have a dataframe of all float columns. For example:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list(\'         


        
4条回答
  •  余生分开走
    2020-12-10 10:12

    Listed in this post are two NumPy approaches for performance - One would be fully vectorized approach and another with one loop.

    Approach #1

    def numpy_triu1(df):          
        a = df.values
        r,c = np.triu_indices(a.shape[1],1)
        cols = df.columns
        nm = [cols[i]+"_"+cols[j] for i,j in zip(r,c)]
        return pd.DataFrame(a[:,r] - a[:,c], columns=nm)
    

    Sample run -

    In [72]: df
    Out[72]: 
         A    B     C     D
    0  0.0  1.0   2.0   3.0
    1  4.0  5.0   6.0   7.0
    2  8.0  9.0  10.0  11.0
    
    In [78]: numpy_triu(df)
    Out[78]: 
       A_B  A_C  A_D  B_C  B_D  C_D
    0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
    1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
    2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
    

    Approach #2

    If we are okay with array as output or dataframe without specialized column names, here's another -

    def pairwise_col_diffs(a): # a would df.values
        n = a.shape[1]
        N = n*(n-1)//2
        idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
        start, stop = idx[:-1], idx[1:]
        out = np.empty((a.shape[0],N),dtype=a.dtype)
        for j,i in enumerate(range(n-1)):
            out[:, start[j]:stop[j]] = a[:,i,None] - a[:,i+1:]
        return out
    

    Runtime test

    Since OP has mentioned that multi-dim array output would work for them as well, here are the array based approaches from other author(s) -

    # @Allen's soln
    def Allen(arr):
        n = arr.shape[1]
        idx = np.asarray(list(itertools.combinations(range(n),2))).T
        return arr[:,idx[0]]-arr[:,idx[1]]
    
    # @DYZ's soln
    def DYZ(arr):
        result = np.concatenate([(arr.T - arr.T[x])[x+1:] \
                for x in range(arr.shape[1])]).T
        return result
    

    pandas based solution from @Gerges Dib's post wasn't included as it came out very slow as compared to others.

    Timings -

    We will use three dataset sizes - 100, 500 and 1000 :

    In [118]: df = pd.DataFrame(np.random.randint(0,9,(3,100)))
         ...: a = df.values
         ...: 
    
    In [119]: %timeit DYZ(a)
         ...: %timeit Allen(a)
         ...: %timeit pairwise_col_diffs(a)
         ...: 
    1000 loops, best of 3: 258 µs per loop
    1000 loops, best of 3: 1.48 ms per loop
    1000 loops, best of 3: 284 µs per loop
    
    In [121]: df = pd.DataFrame(np.random.randint(0,9,(3,500)))
         ...: a = df.values
         ...: 
    
    In [122]: %timeit DYZ(a)
         ...: %timeit Allen(a)
         ...: %timeit pairwise_col_diffs(a)
         ...: 
    100 loops, best of 3: 2.56 ms per loop
    10 loops, best of 3: 39.9 ms per loop
    1000 loops, best of 3: 1.82 ms per loop
    
    In [123]: df = pd.DataFrame(np.random.randint(0,9,(3,1000)))
         ...: a = df.values
         ...: 
    
    In [124]: %timeit DYZ(a)
         ...: %timeit Allen(a)
         ...: %timeit pairwise_col_diffs(a)
         ...: 
    100 loops, best of 3: 8.61 ms per loop
    10 loops, best of 3: 167 ms per loop
    100 loops, best of 3: 5.09 ms per loop
    

提交回复
热议问题