R: Vectorize loop to create pairwise matrix

问题

I want to speed up a function for creating a pairwise matrix that describes the number of times an object is selected before and after all other objects, within a set of locations.

Here is an example df:

  df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
                   Fruit = c("apple", "orange", "pear",
                             "orange", "pear",
                             "pear", "apple",
                             "pear", "apple", "orange",
                             "pear", "apple", "orange"),
                   Order = c(1, 2, 3,
                            1, 2,
                            1, 2, 
                            1, 2, 3,
                            1, 1, 1))

In each Shop, Fruit is picked by a customer in a given Order.

The following function creates an m x n pairwise matrix:

loop.function <- function(df){
  
  fruits <- unique(df$Fruit)
  nt <- length(fruits)
  mat <- array(dim=c(nt,nt))
  
  for(m in 1:nt){
    
    for(n in 1:nt){
      
      ## filter df for each pair of fruit
      xm <- df[df$Fruit == fruits[m],]
      xn <- df[df$Fruit == fruits[n],]
      
      ## index instances when a pair of fruit are picked in same shop
      mm <- match(xm$Shop, xn$Shop)
      
      ## filter xm and xn based on mm
      xm <- xm[! is.na(mm),]
      xn <- xn[mm[! is.na(mm)],]
      
      ## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
      mat[m,n] <- sum(xn$Order < xm$Order)
    }
  }
  
  row.names(mat) <- fruits
  colnames(mat) <- fruits
  
  return(mat)
}

Where mat[m,n] is the number of times fruits[m] is picked after fruits[n]. And mat[n,m] is the number of times fruits[m] is picked before fruits[n]. It is not recorded if pairs of fruit are picked at the same time (e.g. in Shop E).

See expected output:

>loop.function(df)
       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

You can see here that pear is chosen twice before apple (in Shop C and D), and apple is chosen once before pear (in Shop A).

I am trying to improve my knowledge of vectorization, especially in place of loops, so I want to know how this loop can be vectorized.

(I have a feeling there may be a solution using outer(), but my knowledge of vectorizing functions is still very limited.)

Update

See benchmarking with real data times = 10000 for loop.function(), tidyverse.function(), loop.function2(), datatable.function() and loop.function.TMS():

Unit: milliseconds
                    expr            min        lq       mean    median         uq      max     neval   cld
      loop.function(dat)     186.588600 202.78350 225.724249 215.56575 234.035750 999.8234    10000     e
     tidyverse.function(dat)  21.523400  22.93695  26.795815  23.67290  26.862700 295.7456    10000   c 
     loop.function2(dat)     119.695400 126.48825 142.568758 135.23555 148.876100 929.0066    10000    d
 datatable.function(dat)       8.517600   9.28085  10.644163   9.97835  10.766749 215.3245    10000  b 
  loop.function.TMS(dat)       4.482001   5.08030   5.916408   5.38215   5.833699  77.1935    10000 a

Probably the most interesting result for me is the performance of tidyverse.function() on the real data. I will have to try add Rccp solutions at a later date - I'm having trouble making them work on the real data.

I appreciate all the interest and answers given to this post - my intention was to learn and improve performance, and there is certainly a lot to learn from all the comments and solutions given. Thanks!

回答1:

It seems not possible to vectorize over the original data frame df. But if you transform it using reshape2::dcast(), to have one line per each shop:

require(reshape2)

df$Fruit <- as.character(df$Fruit)

by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")

#   Shop apple orange pear
# 1    A     1      2    3
# 2    B    NA      1    2
# 3    C     2     NA    1
# 4    D     2      3    1
# 5    E     1      1    1

..., then you can easily vectorize at least for each combination of [m, n]:

fruits <- unique(df$Fruit)
outer(fruits, fruits, 
    Vectorize(
        function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), 
        c("m", "n")
    ), 
    by_shop)
#      [,1] [,2] [,3]
# [1,]    0    0    2
# [2,]    2    0    1
# [3,]    1    2    0

This is probably the solution you desired to do with outer. Much faster solution would be a true vectorization over all combinations of fruits [m, n], but I've been thinking about it and I don't see any way to do it. So I had to use the Vectorize function which of course is much slower than true vectorization.

Benchmark comparison with your original function:

Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
     loop.function(df) 3.788794 3.926851 4.157606 4.002502 4.090898 9.529923   100
 loop.function.TMS(df) 1.582858 1.625566 1.804140 1.670095 1.756671 8.569813   100

Function & benchmark code (also added the preservation of the dimnames):

require(reshape2)   
loop.function.TMS <- function(df) { 
    df$Fruit <- as.character(df$Fruit)
    by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
    fruits <- unique(df$Fruit)
    o <- outer(fruits, fruits, Vectorize(function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), c("m", "n")), by_shop)
    colnames(o) <- rownames(o) <- fruits
    o
}

require(microbenchmark)
microbenchmark(loop.function(df), loop.function.TMS(df))

回答2:

A data.table solution :

library(data.table)
setDT(df)
setkey(df,Shop)
dcast(df[df,on=.(Shop=Shop),allow.cartesian=T][
           ,.(cnt=sum(i.Order<Order&i.Fruit!=Fruit)),by=.(Fruit,i.Fruit)]
      ,Fruit~i.Fruit,value.var='cnt')

    Fruit apple orange pear
1:  apple     0      0    2
2: orange     2      0    1
3:   pear     1      2    0

The Shop index isn't necessary for this example, but will probably improve performance on a larger dataset.

As the question raised many comments on performance, I decided to check what Rcpp could bring:

library(Rcpp)
cppFunction('NumericMatrix rcppPair(DataFrame df) {

std::vector<std::string> Shop = Rcpp::as<std::vector<std::string> >(df["Shop"]);
Rcpp::NumericVector Order = df["Order"];
Rcpp::StringVector Fruit = df["Fruit"];
StringVector FruitLevels = sort_unique(Fruit);
IntegerVector FruitInt = match(Fruit, FruitLevels);
int n  = FruitLevels.length();

std::string currentShop = "";
int order, fruit, i, f;

NumericMatrix result(n,n);
NumericVector fruitOrder(n);

for (i=0;i<Fruit.length();i++){
    if (currentShop != Shop[i]) {
       //Init counter for each shop
       currentShop = Shop[i];
       std::fill(fruitOrder.begin(), fruitOrder.end(), 0);
    }
    order = Order[i];
    fruit = FruitInt[i];
    fruitOrder[fruit-1] = order;
    for (f=0;f<n;f++) {
       if (order > fruitOrder[f] & fruitOrder[f]>0 ) { 
         result(fruit-1,f) = result(fruit-1,f)+1; 
    }
  }
}
rownames(result) = FruitLevels;
colnames(result) = FruitLevels;
return(result);
}
')

rcppPair(df)

       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

On the example dataset, this runs >500 times faster than the data.table solution, probably because it doesn't have the cartesian product problem. This isn't supposed to be robust on wrong input, and expects that shops / order are in ascending order.

Considering the few minutes spent to find the 3 lines of code for the data.table solution, compared to the much longer Rcpp solution / debugging process, I wouldn't recommend to go for Rcpp here unless there's a real performance bottleneck.

Interesting however to remember that if performance is a must, Rcpp might be worth the effort.

回答3:

Here is an approach that makes simple modifications to make it 5x faster.

loop.function2 <- function(df){

    spl_df = split(df[, c(1L, 3L)], df[[2L]])
    
    mat <- array(0L,
                 dim=c(length(spl_df), length(spl_df)),
                 dimnames = list(names(spl_df), names(spl_df)))
    
    for (m in 1:(length(spl_df) - 1L)) {
        xm = spl_df[[m]]
        mShop = xm$Shop
        for (n in ((1+m):length(spl_df))) {
            xn = spl_df[[n]]
            mm = match(mShop, xn$Shop)
            inds = which(!is.na(mm))
            mOrder = xm[inds, "Order"]
            nOrder = xn[mm[inds], "Order"]

            mat[m, n] <- sum(nOrder < mOrder)
            mat[n, m] <- sum(mOrder < nOrder)
        }
    }
    mat
}

There are 3 main concepts:

The original df[df$Fruits == fruits[m], ] lines were inefficient as you would be making the same comparison length(Fruits)^2 times. Instead, we can use split() which means we are only scanning the Fruits once.
There was a lot of use of df$var which will extract the vector during each loop. Here, we place the assignment of xm outside of the inner loop and we try to minimize what we need to subset / extract.
I changed it to be closer to combn as we can re-use our match() condition by doing both sum(xmOrder > xnOrder) and then switching it to sum(xmOrder < xnOrder).

Performance:

bench::mark(loop.function(df), loop.function2(df))

# A tibble: 2 x 13
##  expression              min median
##  <bch:expr>         <bch:tm> <bch:>
##1 loop.function(df)    3.57ms 4.34ms
##2 loop.function2(df)  677.2us 858.6us

My hunch is that for your larger dataset, @Waldi's data.table solution will be faster. But for smaller datasets, this should be pretty perfomant.

Finally, here's yet another rcpp approach that seems to be slower than @Waldi:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerMatrix loop_function_cpp(List x) {
    int x_size = x.size();
    IntegerMatrix ans(x_size, x_size);
    
    for (int m = 0; m < x_size - 1; m++) {
        DataFrame xm = x[m];
        CharacterVector mShop = xm[0];
        IntegerVector mOrder = xm[1];
        int nrows = mShop.size();
        for (int n = m + 1; n < x_size; n++) {
            DataFrame xn = x[n];
            CharacterVector nShop = xn[0];
            IntegerVector nOrder = xn[1];
            for (int i = 0; i < nrows; i++) {
                for (int j = 0; j < nrows; j++) {
                    if (mShop[i] == nShop[j]) {
                        if (mOrder[i] > nOrder[j])
                           ans(m, n)++;
                        else
                            ans(n, m)++;
                        break;
                    }
                }
            }
        }
    }
    return(ans);
}

loop_wrapper = function(df) {
  loop_function_cpp(split(df[, c(1L, 3L)], df[[2L]]))
}
loop_wrapper(df)
``

回答4:

OK, here is a solution:

library(tidyverse)

# a dataframe with all fruit combinations
df_compare <-  expand.grid(row_fruit = unique(df$Fruit)
                           , column_fruit = unique(df$Fruit)
                           , stringsAsFactors = FALSE)

df_compare %>%
    left_join(df, by = c("row_fruit" = "Fruit")) %>%
    left_join(df, by = c("column_fruit" = "Fruit")) %>%
    filter(Shop.x == Shop.y &
               Order.x < Order.y) %>%
    group_by(row_fruit, column_fruit) %>%
    summarise(obs = n()) %>%
    pivot_wider(names_from = row_fruit, values_from = obs) %>%
    arrange(column_fruit) %>%
    mutate_if(is.numeric, function(x) replace_na(x, 0)) %>%
    column_to_rownames("column_fruit") %>%
    as.matrix()

       apple orange pear
apple      0      0    2
orange     2      0    1
pear       1      2    0

If you don't know what is going on in the second code part (df_compare %>% ...), read the "pipe" (%>%) as 'then'. Run the code from df_compare to just before any of the pipes to see the intermediate results.

来源：https://stackoverflow.com/questions/62794776/r-vectorize-loop-to-create-pairwise-matrix

标签

performance

loops

matrix

vectorization