Speeding up a function

五迷三道 提交于 2019-12-24 21:25:35

问题


I want to calculate the first differences for a large panel data set. At the moment this however takes more than an hour. I am really curious to know if there are still any options left to speed up the process. As an example database:

set.seed(1)
DF <- data.table(panelID = sample(50,50),                                                    # Creates a panel ID
                      Country = c(rep("A",30),rep("B",50), rep("C",20)),                      
                      Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
                      Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
                      norm = round(runif(100)/10,2),
                      Income = sample(100,100),
                      Happiness = sample(10,10),
                      Sex = round(rnorm(10,0.75,0.3),2),
                      Age = round(rnorm(10,0.75,0.3),2),
                      Educ = round(rnorm(10,0.75,0.3),2))           
DF [, uniqueID := .I]  

So what I have tried is the following:

DFx <- DF
start_time <- Sys.time()
    DF <- DF[, lapply(.SD, function(x) x - shift(x)), by = panelID, .SDcols = (sapply(DF, is.numeric))]
end_time <- Sys.time()
DF <- DFx
start_time2 <- Sys.time()
    cols = sapply(DF, is.numeric)
    DF <- DF[, lapply(.SD, function(x) x - shift(x)), by = panelID, .SDcols = cols]
end_time2 <- Sys.time()
DF <- DFx
start_time3 <- Sys.time()
DF <- DF[order(panelID)] # Sort on year
nm1 <- sapply(DF, is.numeric) # Get the numerical columns  
nm1 = names(nm1) 
nm2 <- paste("delta", nm1, sep="_")[-6] # Paste
DF <- DF[,(nm2) := .SD - shift(.SD), by=panelID] # Creates 
end_time3 <- Sys.time()
end_time3 - start_time3
end_time2 - start_time2
end_time - start_time

For some reason the third option works on my actual database, but not for this example. It gives the error: Error in FUN(left, right) : non-numeric argument to binary operator. For my actual database this way of calculating was also rather slow (and then I still have to subset).

Any ideas how to make this faster?


回答1:


data.table is optimized for many rows, not for many columns. Since you have many columns, you could try melting the data.table:

DFm <- melt(DF[, cols, with = FALSE][, !"uniqueID"], id = "panelID") 
#coerces all numers to double (common type), 
#you could separate the data.table by integer/double to avoid this

DFm[, value := c(NA, diff(value)), by = .(panelID, variable)]


dcast(DFm, panelID + rowidv(DFm, cols = c("panelID", "variable")) ~ variable, value.var = "value")


来源:https://stackoverflow.com/questions/57406654/speeding-up-a-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!