Is the “*apply” family really not vectorized?

前端 未结 4 1819
萌比男神i
萌比男神i 2020-11-22 05:25

So we are used to say to every R new user that \"apply isn\'t vectorized, check out the Patrick Burns R Inferno Circle 4\" which says (I quote):

<
4条回答
  •  感动是毒
    2020-11-22 06:07

    So to sum the great answers/comments up into some general answer and provide some background: R has 4 types of loops (in from not-vectorized to vectorized order)

    1. R for loop that repeatedly calls R functions in each iterations (Not vectorised)
    2. C loop that repeatedly calls R functions in each iterations (Not vectorised)
    3. C loop that calls R function only once (Somewhat vectorized)
    4. A plain C loop that doesn't call any R function at all and uses it own compiled functions (Vectorized)

    So the *apply family is the second type. Except apply which is more of the first type

    You can understand this from the comment in its source code

    /* .Internal(lapply(X, FUN)) */

    /* This is a special .Internal, so has unevaluated arguments. It is
    called from a closure wrapper, so X and FUN are promises. FUN must be unevaluated for use in e.g. bquote . */

    That means that lapplys C code accepts an unevaluated function from R and later evaluates it within the C code itself. This is basically the difference between lapplys .Internal call

    .Internal(lapply(X, FUN))
    

    Which has a FUN argument that holds an R function

    And the colMeans .Internal call which does not have a FUN argument

    .Internal(colMeans(Re(x), n, prod(dn), na.rm))
    

    colMeans, unlike lapply knows exactly what function it needs to use, thus it calculates the mean internally within the C code.

    You can clearly see the evaluation process of the R function in each iteration within lapply C code

     for(R_xlen_t i = 0; i < n; i++) {
          if (realIndx) REAL(ind)[0] = (double)(i + 1);
          else INTEGER(ind)[0] = (int)(i + 1);
          tmp = eval(R_fcall, rho);   // <----------------------------- here it is
          if (MAYBE_REFERENCED(tmp)) tmp = lazy_duplicate(tmp);
          SET_VECTOR_ELT(ans, i, tmp);
       }
    

    To sum things up, lapply is not vectorized, though it has two possible advantages over the plain R for loop

    1. Accessing and assigning in a loop seems to be faster in C (i.e. in lapplying a function) Although the difference seems big, we, still, stay at the microsecond level and the costly thing is the valuation of an R function in each iteration. A simple example:

      ffR = function(x)  {
          ans = vector("list", length(x))
          for(i in seq_along(x)) ans[[i]] = x[[i]]
          ans 
      }
      
      ffC = inline::cfunction(sig = c(R_x = "data.frame"), body = '
          SEXP ans;
          PROTECT(ans = allocVector(VECSXP, LENGTH(R_x)));
          for(int i = 0; i < LENGTH(R_x); i++) 
                 SET_VECTOR_ELT(ans, i, VECTOR_ELT(R_x, i));
          UNPROTECT(1);
          return(ans); 
      ')
      
      set.seed(007) 
      myls = replicate(1e3, runif(1e3), simplify = FALSE)     
      mydf = as.data.frame(myls)
      
      all.equal(ffR(myls), ffC(myls))
      #[1] TRUE 
      all.equal(ffR(mydf), ffC(mydf))
      #[1] TRUE
      
      microbenchmark::microbenchmark(ffR(myls), ffC(myls), 
                                     ffR(mydf), ffC(mydf),
                                     times = 30)
      #Unit: microseconds
      #      expr       min        lq    median        uq       max neval
      # ffR(myls)  3933.764  3975.076  4073.540  5121.045 32956.580    30
      # ffC(myls)    12.553    12.934    16.695    18.210    19.481    30
      # ffR(mydf) 14799.340 15095.677 15661.889 16129.689 18439.908    30
      # ffC(mydf)    12.599    13.068    15.835    18.402    20.509    30
      
    2. As mentioned by @Roland, it runs a compiled C loop rather an interpreted R loop


    Though when vectorizing your code, there are some things you need to take into account.

    1. If your data set (let's call it df) is of class data.frame, some vectorized functions (such as colMeans, colSums, rowSums, etc.) will have to convert it to a matrix first, simply because this is how they were designed. This means that for a big df this can create a huge overhead. While lapply won't have to do this as it extracts the actual vectors out of df (as data.frame is just a list of vectors) and thus, if you have not so many columns but many rows, lapply(df, mean) can sometimes be better option than colMeans(df).
    2. Another thing to remember is that R has a great variety of different function types, such as .Primitive, and generic (S3, S4) see here for some additional information. The generic function have to do a method dispatch which sometimes a costly operation. For example, mean is generic S3 function while sum is Primitive. Thus some times lapply(df, sum) could be very efficient compared colSums from the reasons listed above

提交回复
热议问题