A replacement for `subset()` for a list of data.frames

眉间皱痕 提交于 2019-12-08 04:35:19

问题


Function foo1 can subset (using subset()) a list of data.frames by one or more requested variables (e.g., by = ESL == 1 or by == ESL == 1 & type == 4).

However, I'm aware of the danger of using subset() in R. Thus, I wonder in foo1 below, what I can use instead of subset() to get the same output?

foo1 <- function(data, by){

  s <- substitute(by)
  L <- split(data, data$study.name) ; L[[1]] <- NULL

  lapply(L, function(x) do.call("subset", list(x, s))) ## What to use instead of `subset`
                                                       ## to get the same output?
}

# EXAMPLE OF USE:
D <- read.csv("https://raw.githubusercontent.com/izeh/i/master/k.csv", header=TRUE) # DATA
foo1(D, ESL == 1) 

回答1:


You can compute on the language. Building on my answer to "Working with substitute after $ sign in R":

foo1 <- function(data, by){

  s <- substitute(by)
  L <- split(data, data$study.name) ; L[[1]] <- NULL

  E <- quote(x$a)
  E[[3]] <- s[[2]]
  s[[2]] <- E

  eval(bquote(lapply(L, function(x) x[.(s),])))
}

foo1(D, ESL == 1) 

This gets more complex for arbitrary subset expressions. You'd need a recursive function that crawls the parse tree and inserts the calls to $ at the right places.

Personally, I'd just use package data.table where this is easier because you don't need $, i.e., you can just do eval(bquote(lapply(L, function(x) setDT(x)[.(s),]))) without changing s. OTOH, I wouldn't do this at all. There is really no reason to split before subsetting.




回答2:


I would guess (based on general knowledge and a quick skim of the answers to the "dangers of subset()" question) that the dangers of subset are intrinsic dangers of non-standard evaluation (NSE); if you want to be able to pass a generic expression and have it evaluated within the context of a data frame, I think you're more or less stuck with subset() or something like it.

If you were willing to use a more constrained set of expressions such as var, vals (looking for cases where the variable indexed by string var took on values in the vector vals) you could use

d[d[[var]] %in% vals, ]

Here var is a string, not a naked R symbol ("cyl" rather than cyl); it's unambiguous that you want to extract it from the data frame.

You could extend this to a vector of variables and a list of vectors of values:

for (i in seq_along(vars)) {
   d <- d[d[[vars[i]]] %in% vals[[i]], ]
}

but if you want the full flexibility of expressions (e.g. to be able to use either ESL == 1 & type == 4 or ESL == 1 | type == 4, or inequalities based on numeric variables) I think you're stuck with an NSE-based approach.

It's conceivable that the new-ish "tidy eval" machinery (in the rlang package, documented in some detail here) would give you a slightly more principled approach, but I don't think the dangers will completely go away.



来源:https://stackoverflow.com/questions/58477309/a-replacement-for-subset-for-a-list-of-data-frames

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!