Aggregate all dataframe row pair combinations using pandas

前端 未结 2 820
你的背包
你的背包 2020-12-19 04:16

I use python pandas to perform grouping and aggregation across data frames, but I would like to now perform specific pairwise aggregation of rows (n choose 2, statistical co

2条回答
  •  南笙
    南笙 (楼主)
    2020-12-19 05:11

    Before going too far, you should keep in mind your data gets big pretty fast. With 5 rows, output will be C(5,2) or 5+4+3+2+1 and so on.

    That said, I'd think about doing this in numpy for speed (you may want to add a numpy tag to your question btw). Anyway, this isn't as vectorized as it might be, but ought to be a start at least:

    df2 = df.set_index('Gene').loc[mygenes].reset_index()
    
    import math
    sz = len(df2)
    sz2 = math.factorial(sz) / ( math.factorial(sz-2) * 2 )
    
    Gene = df2['Gene'].tolist()
    abc = df2.ix[:,1:].values
    
    import math
    arr = np.zeros([sz2,4])
    gene2 = []
    k = 0
    
    for i in range(sz):
        for j in range(sz):
            if i != j and i < j:
                gene2.append( gene[i] + gene[j] )
                arr[k] = abc[i] + abc[j]
                k += 1
    
    pd.concat( [ pd.DataFrame(gene2), pd.DataFrame(arr) ], axis=1 )
    Out[1780]: 
              0  0  1  2  3
    0  ABC1ABC2  1  2  0  1
    1  ABC1ABC3  1  2  1  1
    2  ABC1ABC4  0  1  1  2
    3  ABC2ABC3  2  2  1  0
    4  ABC2ABC4  1  1  1  1
    5  ABC3ABC4  1  1  2  1
    

    Depending on size/speed issues you may need to separate the string and numerical code and vectorize the numerical piece. This code is not likely to scale all that well if your data is big and if it is, that may determine what sort of answer you need (and also may need to think about how you store results).

提交回复
热议问题