R Using paste in data.table to subset variable number of columns and calculate rowMeans

问题

I have started using data.table over the past week and am facing an issue. I have already looked at the solution here and here but I am not entirely sure how it helps in my situation.

Here, is my sample data.

> dput(dt)
structure(list(link = c(1L, 1L, 1L, 1L, 1L, 1L), id = c(8395, 8738, 9788, 9789, 9908, 9920), person = c(2937837, 3092435, 3511555, 3511555, 3568112, 3575082), seqid = c(11, 14, 9, 1, 7, 10), time = c(NA, NA, 25372, 50700, NA, NA), max = c(14, 31, 9, 7, 8, 11), hr = c(NA, NA, 7, 14, NA, NA), minhr = c(11, 19, 7, 14, 7, 16), maxhr = c(11, 19, 7, 14, 7, 16), TRAVELTIME0.1avg = c(59, 59, 59, 59, 59, 59 ), TRAVELTIME1.2avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME2.3avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME3.4avg = c(59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819 ), TRAVELTIME4.5avg = c(59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214), TRAVELTIME5.6avg = c(60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798), TRAVELTIME6.7avg = c(59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742 ), TRAVELTIME7.8avg = c(59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874), TRAVELTIME8.9avg = c(59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023), TRAVELTIME9.10avg = c(59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216 ), TRAVELTIME10.11avg = c(59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537 ), TRAVELTIME11.12avg = c(59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983 ), TRAVELTIME12.13avg = c(59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587 ), TRAVELTIME13.14avg = c(59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185), TRAVELTIME14.15avg = c(59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924), TRAVELTIME15.16avg = c(59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373 ), TRAVELTIME16.17avg = c(59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671 ), TRAVELTIME17.18avg = c(59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857 ), TRAVELTIME18.19avg = c(59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712 ), TRAVELTIME19.20avg = c(59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269 ), TRAVELTIME20.21avg = c(59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925 ), TRAVELTIME21.22avg = c(59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727 ), TRAVELTIME22.23avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME23.24avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME24.25avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME25.26avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME26.27avg = c(59, 59, 59, 59, 59, 59)), .Names = c("link", "id", "person", "seqid", "time", "max", "hr", "minhr", "maxhr", "TRAVELTIME0.1avg", "TRAVELTIME1.2avg", "TRAVELTIME2.3avg", "TRAVELTIME3.4avg", "TRAVELTIME4.5avg", "TRAVELTIME5.6avg", "TRAVELTIME6.7avg", "TRAVELTIME7.8avg", "TRAVELTIME8.9avg", "TRAVELTIME9.10avg", "TRAVELTIME10.11avg", "TRAVELTIME11.12avg", "TRAVELTIME12.13avg", "TRAVELTIME13.14avg", "TRAVELTIME14.15avg", "TRAVELTIME15.16avg", "TRAVELTIME16.17avg", "TRAVELTIME17.18avg", "TRAVELTIME18.19avg", "TRAVELTIME19.20avg", "TRAVELTIME20.21avg", "TRAVELTIME21.22avg", "TRAVELTIME22.23avg", "TRAVELTIME23.24avg", "TRAVELTIME24.25avg", "TRAVELTIME25.26avg", "TRAVELTIME26.27avg"), sorted = "link", class = c("data.table", "data.frame"), row.names = c(NA, -6L))

Update1: To avoid the issue of internal.selfref do dt <- data.table(dt) after you create dt using the above sample.

I want to use the minhr and maxhr variables to subset the travel times and calculate the rowMeans for those subsetted travel times and add it to the current dt. If minhr (or maxhr) is 11, the corresponding travel time column is TRAVELTIME11.12avg; if it is 19, the corresponding travel time column is TRAVELTIME19.20avg. So, if minhr is 9 and maxhr is 10 for a row, then I need to get the mean of TRAVELTIME9.10avg and TRAVELTIME10.11avg; similarly, if minhr is 15 and maxhr is 17 then I need to get the mean of TRAVELTIME15.16avg, TRAVELTIME16.17avg, and TRAVELTIME17.18avg.

I tried to approach the problem step-wise and used the following code for a simple case of uniform travel time columns across all the rows. It works fine.

> dt[,avg:=rowMeans(.SD[,TRAVELTIME10.11avg:TRAVELTIME12.13avg, with=FALSE]),by=.(id, seqid)]

Next, I tried to modify the above code by introducing paste0() to refer to the column names dynamically. But, this results in an error. Additionally, I have tried to use as.symbol(paste0()), noquote(paste0()) and a couple of other unquoting techniques without any success.

   > dt[,avg:=rowMeans(.SD[,paste0("TRAVELTIME", minhr, "." , minhr+1, "avg"):paste0("TRAVELTIME", maxhr, "." , maxhr+1, "avg"), with=FALSE]),by=.(id, seqid)]

Error in paste0("TRAVELTIME", minhr, ".", minhr + 1, "avg"):paste0("TRAVELTIME",  : 
  NA/NaN argument
In addition: Warning messages:
1: In eval(expr, envir, enclos) : NAs introduced by coercion
2: In eval(expr, envir, enclos) : NAs introduced by coercion

Given this, I have two questions:

1) Why doesn't data.table recognize the column names if the paste command is used (even after unquoting the pasted strings) to subset columns as opposed to directly using the column names? Does it have anything to do with the unequal number of columns for every row?

2) Since I am unsuccessful, can you please suggest a way to find the mean over variable number of columns for every row, and add it back to the dt. I would appreciate if the suggestion leads to an efficient way, because, I already tried this using a simpler looping method and it takes a long time (approximately 12 to 15 hours for my entire dataset) for the size of my data.

回答1:

I believe this solves the problem you were having with paste0:

tmp  <- paste0("TRAVELTIME", dt$minhr, "." , dt$minhr+1, "avg")
tmp1 <- paste0("TRAVELTIME", dt$maxhr, "." , dt$maxhr+1, "avg")
dt1  <- dt[,avg:=rowMeans(.SD[,get(tmp):get(tmp1), with=FALSE]),by=.(dt$id, dt$seqid)]

Someone will probably point out that you don't strictly need the $ in the last line, but due to the nature of the problem you were having I felt this was useful for identifying and solving the problem.

来源：https://stackoverflow.com/questions/38470752/r-using-paste-in-data-table-to-subset-variable-number-of-columns-and-calculate-r

标签

data.table

paste