Row maximum in data table

感情迁移 提交于 2020-01-22 14:37:23

问题


I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.

I can quickly get which column has the maximum value for each row using

dt <- dt[, maxCol := which.max(.SD), by=pmxid]

but trying to get the actual maximum value using

dt <- dt[, nmax := max(.SD), by=pmxid]

is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.

How come finding the maximum takes so long? Shouldn't it take the same time as which.max() or less?


回答1:


Though, you are seeking a data.table solution, here is a base R solution which would be fast enough for your dataset.

indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]

On a slightly bigger dataset, system.time comparisons revealed

system.time({
 indx <- max.col(df1, ties.method='first')
 res <- df1[cbind(1:nrow(df1), indx)]
})
#   user  system elapsed 
# 2.180   0.163   2.345 



df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
#      user   system  elapsed 
#1265.792    2.305 1267.836 

base R method to be faster than the data.table method in the post.

data

set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10, 
       replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999

set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
 replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999


来源:https://stackoverflow.com/questions/28486654/row-maximum-in-data-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!