I have to compare data from each row of a Pandas DataFrame with data from the rest of the rows, is there a way to speed up the computation?

问题

Let's say I have a pandas DataFrame (loaded from a csv file) with this structure (the number of var and err columns is not fixed, and it varies from file to file):

var_0; var_1; var_2;
32;    9;     41;
47;    22;    41;
15;    12;    32;
3;     4;     4;
10;    9;     41;
43;    21;    45;
32;    14;    32;
51;    20;    40;

Let's discard the err_ds_j and the err_mean columns for the sake of this question. I have to perform an automatic comparison of the values of each row, with the values of the other rows; as an example: I have to compare the first row with the second row, then with the third, then with the fourth, and so on, then I have to take the second row and compare it with the first one, then with the third one, and so on for the rest of the DataFrame.

Going deeper into problem, I want to see if for each couple of rows, all the "var_i" values from one of them are higher or equal than the correspondent values of the other row. If this is satisfied, the row with higher values is called DOMINANT, and I add a row in another DataFrame, with this structure:

SET_A; SET_B; DOMINANT_SET
0;     1;     B
...

Where SET_A and SET_B values are indices from the csv DataFrame, and DOMINANT_SET tells me which one of the two is the dominant set (or if there's none, it's just assigned as "none"). I found the third column to be useful since it helps me avoiding the comparison of rows I've already compared in the opposite way (e.g.: comparing row 1 with row 0 is useless, since I've already compared 0 and 1 previously).

So, for that csv file, the output produced should be (and actually is, with my code):

   SET_A SET_B DOMINANT_SET
1      0     1            B
2      0     2         none
3      0     3            A
4      0     4            A
5      0     5            B
6      0     6         none
7      0     7         none
8      1     2            A
9      1     3            A
10     1     4            A
11     1     5         none
12     1     6            A
13     1     7         none
14     2     3            A
15     2     4         none
16     2     5            B
17     2     6            B
18     2     7            B
19     3     4            B
20     3     5            B
21     3     6            B
22     3     7            B
23     4     5            B
24     4     6         none
25     4     7         none
26     5     6            A
27     5     7         none
28     6     7            B

I've already written all of the code for this particular problem, and it works just fine with some test datasets (100 rows sampled from an actual dataset).

Here's a snippet of the relevant code:

import numpy as np
import pandas as pd

def couple_already_tested(index1, index2, dataframe):
    return (((dataframe['SET_A'] == index1) & (dataframe['SET_B'] == index2)).any()) | (((dataframe['SET_A'] == index2) & (dataframe['SET_B'] == index1)).any())

def check_dominance(set_a, set_b, index_i, index_j, dataframe):
    length = dataframe.shape[0]
    if np.all(set_a >= set_b):
        print("FOUND DOMINANT CONFIGURATION A > B")
        dataframe.loc[length+1] = [index_i,index_j,'A']
    elif np.all(set_b >= set_a):
        print("FOUND DOMINANT CONFIGURATION B > A")
        dataframe.loc[length+1] = [index_i,index_j,'B']
    else:
        dataframe.loc[length+1] = [index_i,index_j,'none']

df = pd.read_csv('test.csv', sep=';')
dom_table_df = pd.DataFrame(columns=['SET_A','SET_B','DOMINANT_SET'])
df_length = df.shape[0]
var_num = df.shape[1]-1 

a = None
b = None

for i in range(0, df_length):
    a = df.iloc[i, 0:var_num].values
    for j in range(0, df_length):
        if j == i:
            continue
        b = df.iloc[j, 0:var_num].values
        if couple_already_tested(i,j,dom_table_df):
            print("WARNING: configuration", i, j, "already compared, skipping")
        else:
            print("Comparing configuration at row", i, "with configuration at row", j)
            check_dominance(a, b, i, j, dom_table_df)

print(dom_table_df)

The issue is that, being not so proficient in both python and pandas (I've been learning them for about one and a half months), this code is of course terribly slow (for datasets with, like, 1000 to 10000 rows) because I'm using iterations in my algorithm. I know I can use something called vectorization, but reading about it I'm not entirely sure that's good for my use case.

So, how could I speed up the calculations?

回答1:

Another speedup can be accomplished by replacing .iloc[].values as well as .loc[] with .values[], but with .loc[] we have to adjust the subscript, because .values takes a zero-based subscript, which is different from our 1-based dom_table_df.index.

dom_table_df = pd.DataFrame(index=np.arange(1, 1+(df_length**2-df_length)/2).astype('i'),
                            columns=['SET_A', 'SET_B', 'DOMINANT_SET'])
length = 0  # counter of already filled rows
for i in range(0, df_length):
    a = df.values[i, 0:var_num]
    for j in range(i+1, df_length): # we can skip the range from 0 to i
        b = df.values[j, 0:var_num]
        #print("Comparing configuration at row", i, "with configuration at row", j)
        if np.all(a >= b):
            #print("FOUND DOMINANT CONFIGURATION A > B")
            dom_table_df.values[length] = [i, j, 'A']
        elif np.all(b >= a):
            #print("FOUND DOMINANT CONFIGURATION B > A")
            dom_table_df.values[length] = [i, j, 'B']
        else:
            dom_table_df.values[length] = [i, j, 'none']
        length += 1

回答2:

It's not a major change of the algorithm, but you can save more than half of the loop cycles as well as the tests for j == i and couple_already_tested if you choose the range for j adequately, so the main loops become:

for i in range(0, df_length):
    a = df.iloc[i, 0:var_num].values
    for j in range(i+1, df_length): # we can skip the range from 0 to i
        b = df.iloc[j, 0:var_num].values
        #print("Comparing configuration at row", i, "with configuration at row", j)
        check_dominance(a, b, i, j, dom_table_df)

回答3:

Another (surprisingly) significant speedup can be accomplished by preallocating the output DataFrame rather than appending one row after the other. We can compute the resulting number of rows as

(df_length² − df_length) ÷ 2

In order to determine the row number where to insert the current output data set, we can maintain a counter now instead of dataframe.shape[0]. This gives:

dom_table_df = pd.DataFrame(index=np.arange(1, 1+(df_length**2-df_length)/2).astype('i'),
                            columns=['SET_A', 'SET_B', 'DOMINANT_SET'])
length = 0  # counter of already filled rows
for i in range(0, df_length):
    a = df.iloc[i, 0:var_num].values
    for j in range(i+1, df_length): # we can skip the range from 0 to i
        b = df.iloc[j, 0:var_num].values
        #print("Comparing configuration at row", i, "with configuration at row", j)
        length += 1
        if np.all(a >= b):
            #print("FOUND DOMINANT CONFIGURATION A > B")
            dom_table_df.loc[length] = [i, j, 'A']
        elif np.all(b >= a):
            #print("FOUND DOMINANT CONFIGURATION B > A")
            dom_table_df.loc[length] = [i, j, 'B']
        else:
            dom_table_df.loc[length] = [i, j, 'none']

来源：https://stackoverflow.com/questions/57572695/i-have-to-compare-data-from-each-row-of-a-pandas-dataframe-with-data-from-the-re

标签

python

pandas

performance

dataframe

optimization