A replacement for `subset()` for a list of data.frames

问题

Function foo1 can subset (using subset()) a list of data.frames by one or more requested variables (e.g., by = ESL == 1 or by == ESL == 1 & type == 4).

However, I'm aware of the danger of using subset() in R. Thus, I wonder in foo1 below, what I can use instead of subset() to get the same output?

foo1 <- function(data, by){

  s <- substitute(by)
  L <- split(data, data$study.name) ; L[[1]] <- NULL

  lapply(L, function(x) do.call("subset", list(x, s))) ## What to use instead of `subset`
                                                       ## to get the same output?
}

# EXAMPLE OF USE:
D <- read.csv("https://raw.githubusercontent.com/izeh/i/master/k.csv", header=TRUE) # DATA
foo1(D, ESL == 1)

回答1:

You can compute on the language. Building on my answer to "Working with substitute after $ sign in R":

foo1 <- function(data, by){

  s <- substitute(by)
  L <- split(data, data$study.name) ; L[[1]] <- NULL

  E <- quote(x$a)
  E[[3]] <- s[[2]]
  s[[2]] <- E

  eval(bquote(lapply(L, function(x) x[.(s),])))
}

foo1(D, ESL == 1)

This gets more complex for arbitrary subset expressions. You'd need a recursive function that crawls the parse tree and inserts the calls to $ at the right places.

Personally, I'd just use package data.table where this is easier because you don't need $, i.e., you can just do eval(bquote(lapply(L, function(x) setDT(x)[.(s),]))) without changing s. OTOH, I wouldn't do this at all. There is really no reason to split before subsetting.

回答2:

I would guess (based on general knowledge and a quick skim of the answers to the "dangers of subset()" question) that the dangers of subset are intrinsic dangers of non-standard evaluation (NSE); if you want to be able to pass a generic expression and have it evaluated within the context of a data frame, I think you're more or less stuck with subset() or something like it.

If you were willing to use a more constrained set of expressions such as var, vals (looking for cases where the variable indexed by string var took on values in the vector vals) you could use

d[d[[var]] %in% vals, ]

Here var is a string, not a naked R symbol ("cyl" rather than cyl); it's unambiguous that you want to extract it from the data frame.

You could extend this to a vector of variables and a list of vectors of values:

for (i in seq_along(vars)) {
   d <- d[d[[vars[i]]] %in% vals[[i]], ]
}

but if you want the full flexibility of expressions (e.g. to be able to use either ESL == 1 & type == 4 or ESL == 1 | type == 4, or inequalities based on numeric variables) I think you're stuck with an NSE-based approach.

It's conceivable that the new-ish "tidy eval" machinery (in the rlang package, documented in some detail here) would give you a slightly more principled approach, but I don't think the dangers will completely go away.

来源：https://stackoverflow.com/questions/58477309/a-replacement-for-subset-for-a-list-of-data-frames

标签

list

function

dataframe

subset