lapply vs for loop - Performance R

こ雲淡風輕ζ 提交于 2019-11-26 17:45:37
Joris Meys

First of all, it is an already long debunked myth that for loops are any slower than lapply. The for loops in R have been made a lot more performant and are currently at least as fast as lapply.

That said, you have to rethink your use of lapply here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply.

lapply is a function you should use for its side effects (or lack of side effects). The function lapply combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for loop. The same goes for replicate. See also this question:

Is R's apply family more than syntactic sugar?

The reason your lapply solution is far slower, is because your way of using it creates a lot more overhead.

  • replicate is nothing else but sapply internally, so you actually combine sapply and lapply to implement your double loop. sapply creates extra overhead because it has to test whether or not the result can be simplified. So a for loop will be actually faster than using replicate.
  • inside your lapply anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function $ has to be called every time.
  • Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your for solution that only calls 26. These extra functions for the lapply solution include calls to functions like match, structure, [[, names, %in%, sys.call, duplicated, ... All functions not needed by your for loop as that one doesn't do any of these checks.

If you want to see where this extra overhead comes from, look at the internal code of replicate, unlist, sapply and simplify2array.

You can use the following code to get a better idea of where you lose your performance with the lapply. Run this line by line!

Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self

Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 
Rprof(NULL)
perprof <- summaryRprof()$by.self

fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)

Selftime <- merge(fprof, perprof,
                  all = TRUE,
                  by = 'Fun',
                  suffixes = c(".lapply",".for"))

sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
         c("Fun","self.time.lapply","self.time.for")]

Selftime[is.na(Selftime$self.time.for),]

Actually,

I did test the difference with a a problem that a solve recently.

Just try yourself.

In my conclusion, have no difference but for loop to my case were insignificantly more faster than lapply.

Ps: I try mostly keep the same logic in use.

ds <- data.frame(matrix(rnorm(1000000), ncol = 8))  
n <- c('a','b','c','d','e','f','g','h')  
func <- function(ds, target_col, query_col, value){
  return (unique(as.vector(ds[ds[query_col] == value, target_col])))  
}  

f1 <- function(x, y){
  named_list <- list()
  for (i in y){
    named_list[[i]] <- func(x, 'a', 'b', i)
  }
  return (named_list)
}

f2 <- function(x, y){
  list2 <- lapply(setNames(nm = y), func, ds = x, target_col = "a", query_col = "b")
  return(list2)
}

benchmark(f1(ds2, n ))
benchmark(f2(ds2, n ))

As you could see, I did a simple routine to build a named_list based in a dataframe, the func function does the column values extracted, the f1 uses a for loop to iterate through the dataframe and the f2 uses a lapply function.

In my computer I get this results:

test replications elapsed relative user.self sys.self user.child
1 f1(ds2, n)          100  110.24        1   110.112        0          0
  sys.child
1         0

&&

        test replications elapsed relative user.self sys.self user.child
1 f1(ds2, n)          100  110.24        1   110.112        0          0
  sys.child
1         0
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!