问题
I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.
I can quickly get which column has the maximum value for each row using
dt <- dt[, maxCol := which.max(.SD), by=pmxid]
but trying to get the actual maximum value using
dt <- dt[, nmax := max(.SD), by=pmxid]
is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.
How come finding the maximum takes so long? Shouldn't it take the same time as which.max()
or less?
回答1:
Though, you are seeking a data.table
solution, here is a base R
solution which would be fast enough for your dataset.
indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]
On a slightly bigger dataset, system.time
comparisons revealed
system.time({
indx <- max.col(df1, ties.method='first')
res <- df1[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 2.180 0.163 2.345
df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
# user system elapsed
#1265.792 2.305 1267.836
base R
method to be faster than the data.table
method in the post.
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10,
replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999
set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999
来源:https://stackoverflow.com/questions/28486654/row-maximum-in-data-table