Group-wise conditional subsetting where feasible

问题

I would like to subset rows of my data

library(data.table); set.seed(333); n <- 100 
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))

> head(dat)
   id group        x        y        z
1:  1     1 109.3400 208.6732 308.7595
2:  2     1 101.6920 201.0989 310.1080
3:  3     1 119.4697 217.8550 313.9384
4:  4     1 111.4261 205.2945 317.3651
5:  5     1 100.4024 212.2826 305.1375
6:  6     1 114.4711 203.6988 319.4913

in several stages, unless it results in an empty subset. In this case, I would like to skip that specific subsetting. In an earlier question, Frank has found a great solution for this:

f = function(x, ..., verbose=FALSE){
  L   = substitute(list(...))[-1]
  mon = data.table(cond = as.character(L))[, skip := FALSE]

  for (i in seq_along(L)){
    d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
    if (nrow(d)){
      x = d
    } else {
      mon[i, skip := TRUE]
    }    
  }
  print(mon)
  return(x)
}

where I can enter the data, and the cut-offs for each variable manually.

> f(dat, x > 119, y > 219, y > 1e6)
        cond  skip
1:   x > 119 FALSE
2:   y > 219 FALSE
3: y > 1e+06  TRUE
   id  group         x        y        z
1: 55      2  119.2634 219.0044 315.6556

I now wonder how this (or something even better!) could be applied to a case where the cut-offs are in a second data.table

c <- data.table(group=1:2, x=c(110,119), y=c(210,219), z=c(310,319))
> c
   group   x   y   z
1:     1 110 210 310
2:     2 119 219 319

and specified for each group separately.

If I were to use f(.), I thought of a join of c into dat but can't figure it out. But perhaps there is a smarter way entirely.

回答1:

First, I would change how c is constructed. You currently have it set up with one column per filter, but a long format would allow you to use multiple filters on the same column similar to your initial example (i.e. two filters on y):

c <- data.table(group=c(1,2,1,2,1,2,1),variable = c("x","x","y","y","z","z","y"), c_val = c(110,119,210,219,310,319,1e6))
c[, c_id := 1:.N]
c
       group variable   c_val c_id
1:     1        x     110    1
2:     2        x     119    2
3:     1        y     210    3
4:     2        y     219    4
5:     1        z     310    5
6:     2        z     319    6
7:     1        y 1000000    7

you can then merge your filters to your data.

dat_cut <- melt(dat, id.vars = c("id", "group"), value.name = "dat_val")
output <- merge(dat_cut, c, by = c("group","variable"), allow.cartesian = TRUE)

This line then tests the filters - you can expand this line if you want to expand your filter logic (greater than / less than, equal etc.), and can code that logic back to c

output <- output[dat_val > c_val]

You then want to find any line where the number of filters met is equal to the unique total number of filters met, for that group:

output[,req_match := uniqueN(c_id), by = .(group)] # number of filters where a condition was met.
selection <- output[,.N,by = .(id, group, req_match)][N == req_match, id]

If a filter did not match any rows, it will be excluded here.

Then you can filter your initial dataset for the solution:

dat[id %in% selection]

   id group        x        y        z
1:  3     1 119.4697 217.8550 313.9384
2: 18     1 117.2930 216.5670 310.4617
3: 35     1 110.4283 218.6130 312.0904
4: 50     1 119.2519 214.2517 318.8567

来源：https://stackoverflow.com/questions/57465187/group-wise-conditional-subsetting-where-feasible

标签

data.table

subset