How to efficiently use Rprof in R?

前端 未结 4 1511
深忆病人
深忆病人 2020-11-28 01:50

I would like to know if it is possible to get a profile from R-Code in a way that is similar to matlab\'s Profiler. That is, to get to know which l

4条回答
  •  挽巷
    挽巷 (楼主)
    2020-11-28 02:08

    Alert readers of yesterdays breaking news (R 3.0.0 is finally out) may have noticed something interesting that is directly relevant to this question:

    • Profiling via Rprof() now optionally records information at the statement level, not just the function level.

    And indeed, this new feature answers my question and I will show how.


    Let's say, we want to compare whether vectorizing and pre-allocating are really better than good old for-loops and incremental building of data in calculating a summary statistic such as the mean. The, relatively stupid, code is the following:

    # create big data frame:
    n <- 1000
    x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))
    
    # reasonable operations:
    marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean)
    
    # unreasonable operations:
    marginal.means.2 <- marginal.means.1[NULL,]
    
    row.counter <- 1
    for (condition in levels(x$condition)) {
      for (group in levels(x$group)) {  
        tmp.value <- 0
        tmp.length <- 0
        for (c in 1:nrow(x)) {
          if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {
            tmp.value <- tmp.value + x[c,"data"]
            tmp.length <- tmp.length + 1
          }
        }
        marginal.means.2[row.counter,"group"] <- group 
        marginal.means.2[row.counter,"condition"] <- condition
        marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length
        row.counter <- row.counter + 1
      }
    }
    
    # does it produce the same results?
    all.equal(marginal.means.1, marginal.means.2)
    

    To use this code with Rprof, we need to parse it. That is, it needs to be saved in a file and then called from there. Hence, I uploaded it to pastebin, but it works exactly the same with local files.

    Now, we

    • simply create a profile file and indicate that we want to save the line number,
    • source the code with the incredible combination eval(parse(..., keep.source = TRUE)) (seemingly the infamous fortune(106) does not apply here, as I haven't found another way)
    • stop the profiling and indicate that we want the output based on the line numbers.

    The code is:

    Rprof("profile1.out", line.profiling=TRUE)
    eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE))
    Rprof(NULL)
    
    summaryRprof("profile1.out", lines = "show")
    

    Which gives:

    $by.self
                               self.time self.pct total.time total.pct
    download.php?i=KjdkSVZq#17      8.04    64.11       8.04     64.11
                       4.38    34.93       4.38     34.93
    download.php?i=KjdkSVZq#16      0.06     0.48       0.06      0.48
    download.php?i=KjdkSVZq#18      0.02     0.16       0.02      0.16
    download.php?i=KjdkSVZq#23      0.02     0.16       0.02      0.16
    download.php?i=KjdkSVZq#6       0.02     0.16       0.02      0.16
    
    $by.total
                               total.time total.pct self.time self.pct
    download.php?i=KjdkSVZq#17       8.04     64.11      8.04    64.11
                        4.38     34.93      4.38    34.93
    download.php?i=KjdkSVZq#16       0.06      0.48      0.06     0.48
    download.php?i=KjdkSVZq#18       0.02      0.16      0.02     0.16
    download.php?i=KjdkSVZq#23       0.02      0.16      0.02     0.16
    download.php?i=KjdkSVZq#6        0.02      0.16      0.02     0.16
    
    $by.line
                               self.time self.pct total.time total.pct
                       4.38    34.93       4.38     34.93
    download.php?i=KjdkSVZq#6       0.02     0.16       0.02      0.16
    download.php?i=KjdkSVZq#16      0.06     0.48       0.06      0.48
    download.php?i=KjdkSVZq#17      8.04    64.11       8.04     64.11
    download.php?i=KjdkSVZq#18      0.02     0.16       0.02      0.16
    download.php?i=KjdkSVZq#23      0.02     0.16       0.02      0.16
    
    $sample.interval
    [1] 0.02
    
    $sampling.time
    [1] 12.54
    

    Checking the source code tells us that the problematic line (#17) is indeed the stupid if-statement in the for-loop. Compared with basically no time for calculating the same using vectorized code (line #6).

    I haven't tried it with any graphical output, but I am already very impressed by what I got so far.

提交回复
热议问题