Why not use a for loop?

后端 未结 2 627
野趣味
野趣味 2021-01-18 01:13

I\'ve been seeing a lot of comments among data scientists online about how for loops are not advisable. However, I recently found myself in a situation where using one was h

2条回答
  •  [愿得一人]
    2021-01-18 01:40

    For your use case, I would say the point is moot. Applying vectorization (and, in the process, obfuscating the code) has no benefits here.

    Here's an example below, where I did a microbenchmark::microbenchmark of your solution as presented in OP, Moody's solution as in his post, and a third solution of mine, with even more vectorization (triple nested lapply).

    Microbenchmark

    set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
    DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
    dat = data.frame(code, time, DV1, DV2, DV3, DV4)
    
    library(microbenchmark)
    
    microbenchmark(
        `Peter Miksza` = {
            outANOVA1 = list()
            for (i in names(dat)) {
                y = dat[[i]]
                outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)), 
                    data = dat))
        }},
        Moody_Mudskipper = {
            outANOVA2 =
                lapply(dat,function(y)
                    summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
        },
        `catastrophic_failure` = {
            outANOVA3 = 
                lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
        },
        times = 1000L)
    

    Results

    #Unit: milliseconds
    #                 expr      min       lq     mean   median       uq       max neval cld
    #         Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448  1000   b
    #     Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811  1000  a 
    # catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666  68.87781  1000  a 
    

    fiddling with JIT compilation, running compiler::setCompilerOptions(optimize = 0) and compiler::enableJIT(0) the following result ensues as well

    #Unit: milliseconds
    #                 expr      min       lq     mean   median       uq      max neval cld
    #         Peter Miksza 23.10125 24.27295 28.46968 26.52559 30.45729 143.0731  1000   a
    #     Moody_Mudskipper 22.82366 24.35622 28.33038 26.72574 30.27768 146.4284  1000   a
    # catastrophic_failure 22.59413 24.04295 27.99147 26.23098 29.88066 120.6036  1000   a
    

    Conclusion

    As alluded by Dirk's comment, there isn't a difference in performance, but readability is greatly impaired using vectorization.

    On growing lists

    Experimenting with Moody's solutions, it seems growing lists can be a bad idea if the resulting list is moderately long. Also, using byte-compiled functions directly can provide a small improvement in performance. Both are expected behaviors. Pre-allocation might prove sufficient for your application though.

提交回复
热议问题