Row maximum in data table

问题

I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.

I can quickly get which column has the maximum value for each row using

dt <- dt[, maxCol := which.max(.SD), by=pmxid]

but trying to get the actual maximum value using

dt <- dt[, nmax := max(.SD), by=pmxid]

is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.

How come finding the maximum takes so long? Shouldn't it take the same time as which.max() or less?

回答1:

Though, you are seeking a data.table solution, here is a base R solution which would be fast enough for your dataset.

indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]

On a slightly bigger dataset, system.time comparisons revealed

system.time({
 indx <- max.col(df1, ties.method='first')
 res <- df1[cbind(1:nrow(df1), indx)]
})
#   user  system elapsed 
# 2.180   0.163   2.345 



df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
#      user   system  elapsed 
#1265.792    2.305 1267.836

base R method to be faster than the data.table method in the post.

data

set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10, 
       replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999

set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
 replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999

来源：https://stackoverflow.com/questions/28486654/row-maximum-in-data-table

标签

data.table