Speedy complex Pandas & dictionary manipulation

问题

I am new to Pandas and seeking some advice on a tricky pivot table manipulation please.

I have two Pandas pivot tables and a dictionary. The first pivot table has some values that are zero. The second pivot table has the same factors and levels but different values. The dictionary is a set of all possible level pairs for each factor. Sample code:

df = pd.DataFrame({'MyColumn1': ['A', 'A', 'B', 'B'],
                   'MyColumn2': ['M', 'N', 'M', 'P'],
                   'Value': [1, 1, 1, 1]})
table = pd.pivot_table(df, values='Value', index=['MyColumn1', 'MyColumn2'], aggfunc=np.sum, fill_value = 0, dropna = False)

df2 = pd.DataFrame({'MyColumn1': ['A', 'A', 'A', 'B', 'B', 'B'],
                   'MyColumn2': ['M', 'N', 'P', 'M', 'N', 'P'],
                   'Value': [5, 10, 15, 20, 25, 30]})
table2 = pd.pivot_table(df2, values='Value', index=['MyColumn1', 'MyColumn2'], aggfunc=np.sum)

myDictionary = {'MyColumn1': {('A', 'A'): 10, ('A', 'B'): 20, 
                  ('B', 'A'): 30, ('B', 'B'): 40},
        'MyColumn2': {('M', 'M'): 1, ('M', 'N'): 2, ('M', 'P'): 3,
                  ('N', 'M'): 4, ('N', 'N'): 5, ('N', 'P'): 6,
                  ('P', 'M'): 7, ('P', 'N'): 8, ('P', 'P'): 9}}

This code generates the following tables:

TABLE                Value
MyColumn1 MyColumn2       
A         M              1
          N              1
          P              0
B         M              1
          N              0
          P              1

TABLE2               Value
MyColumn1 MyColumn2       
A         M              5
          N             10
          P             15
B         M             20
          N             25
          P             30

For every non-zero entry in the Table, I'd like to (1) go through every zero entry in the Table and find the product of the numbers in myDictionary that correspond to the zero entry key and the non-zero entry key, (2) multiply each product by the corresponding value in Table2 for the zero'd row, and then (3) find the sum of these products.

For instance, with the data above, the algorithm would calculate:

(A, P) -> (A, M) = 10 * 7 = 70, 70 * 15 = 1050
(A, P) -> (A, N) = 10 * 8 = 80, 80 * 15 = 1200
(A, P) -> (B, M) = 20 * 7 = 140, 140 * 15 = 2100
(A, P) -> (B, P) = 20 * 9 = 180, 180 * 15 = 2700

(B, N) -> (A, M) = 30 * 4 = 120, 120 * 25 = 3000
(B, N) -> (A, N) = 30 * 5 = 150, 150 * 25 = 3750
(B, N) -> (B, M) = 40 * 4 = 160, 160 * 25 = 4000
(B, N) -> (B, P) = 40 * 6 = 240, 240 * 25 = 6000

The first four rows correspond to the first zero entry, and the second four rows correspond to the second zero entry in the Table. There are four rows because there are four non-zero entries. Next, the algorithm multiplies by the corresponding value in Table2. Finally, it would add up for each non-zero entry:

(A, M) total = 1050 + 3000 = 4050
(A, N) total = 1200 + 3750 = 4950
(B, M) total = 2100 + 4000 = 6100
(B, P) total = 2700 + 6000 = 8700

The desired result is a pivot table of the same shape as the original table but with these sums as the values.

                     Value
MyColumn1 MyColumn2       
A         M           4050
          N           4950
          P              0
B         M           6100
          N              0
          P           8700

I am looking for a way to do this efficiently because, in practice, I will have 100,000+ zero entries, <1000 non-zero entries, and ~10 columns. I'm not sure how to structure the loops for performance reasons, or even if loops are the most efficient way to do this. Any help would be much appreciated, thank you :)

回答1:

Not sure how fast this is on your actual data, but this is what I would do:

col1_df = pd.Series(myDictionary['MyColumn1']).unstack()
col2_df = pd.Series(myDictionary['MyColumn2']).unstack()

out_df = pd.DataFrame()

# loop through columns
for col in table.columns:    
    zeros = table['Value'].eq(0)
    non_zero_idx = np.array(table.index[~zeros].to_list())
    zero_idx = np.array(table.index[zeros].to_list())

    num_nz, num_z = len(non_zero_idx), len(zero_idx)

    xs,ys = np.meshgrid(np.arange(num_z),np.arange(num_nz))
    xs, ys = xs.ravel(), ys.ravel()

    col1 = col1_df.lookup(zero_idx[xs,0], non_zero_idx[ys,0])
    col2 = col2_df.lookup(zero_idx[xs,1], non_zero_idx[ys,1])

    prods = (col1* col2).reshape(num_nz, num_z).T

    values = table2.loc[zeros, ['Value']].values

    out_df[col] = (pd.Series((prods * values).sum(0), index=non_zero_idx)
                     .reindex(table.index, fill_value=0)
                  )

Output:

                     Value
MyColumn1 MyColumn2       
A         M           4050
          N           4950
          P              0
B         M           6100
          N              0
          P           8700

来源：https://stackoverflow.com/questions/62225670/speedy-complex-pandas-dictionary-manipulation

标签

python

arrays

pandas

numpy

multidimensional-array