问题
I have a data like this
n <- 1e5
set.seed(24)
df1 <- data.frame(query_string = sample(sprintf("%06d", 100:1000),
n, replace=TRUE), id.x = sample(1:n),
s_val = sample(paste0("F", 400:700), n,
replace=TRUE), id.y = sample(100:3000, n, replace=TRUE),
ID_col_n = sample(100:1e6, n, replace=TRUE), total_id = 1:n)
I use the spread function to assign common strings using the following function
library(tidyr)
res <- spread(resNik,s_val,value=query_string,fill=NA)
This works perfectly but when the data is huge, it is like never going to end. I don't know if my computer is hanged or it is still running because after two hours still nothing coming up
I am wondering if one can help me to use another function or something else which works faster than spread
?
回答1:
Based on the benchmarks on 1e5
rows dcast
from data.table
is faster
library(data.table)
system.time({res1 <- spread(df1,s_val,value=query_string,fill=NA)})
# user system elapsed
# 1.50 0.25 1.75
system.time({res2 <- dcast(setDT(df1), id.x+id.y + ID_col_n +total_id~s_val,
value.var = "query_string")})
# user system elapsed
# 0.61 0.03 0.61
res11 <- res1 %>%
arrange(id.x)
res21 <- res2[order(id.x)]
all.equal(as.data.frame(res11), as.data.frame(res21), check.attributes=FALSE)
#[1] TRUE
The difference is increased with the increase in the number of rows i.e. from changing 'n' to 1e6
system.time({res1 <- spread(df1,s_val,value=query_string,fill=NA)})
# user system elapsed
# 28.64 3.17 31.91
system.time({res2 <- dcast(setDT(df1), id.x+id.y + ID_col_n +total_id~s_val,
value.var = "query_string")})
# user system elapsed
# 5.22 1.08 6.21
data
n <- 1e5
set.seed(24)
df1 <- data.frame(query_string = sample(sprintf("%06d", 100:1000),
n, replace=TRUE), id.x = sample(1:n),
s_val = sample(paste0("F", 400:700), n,
replace=TRUE), id.y = sample(100:3000, n, replace=TRUE),
ID_col_n = sample(100:1e6, n, replace=TRUE), total_id = 1:n)
来源:https://stackoverflow.com/questions/41079280/how-can-i-speed-a-function-in-tidyr-up