Subsetting a data frame based on key spanning several columns in another (summary) data frame

I have a data frame a with 4 identifying columns: A, B, C, D. A second data frame b, created with ddply(), contains a summary of all the values for different Ds for every set of A,B,C. A third data frame c contains a subset of b with bad values that I want to delete from a.

Thus, I want a subset from a, omitting all the rows identified by a combination of A,B,C that are also present in c. I can think of ways do this (ugly and inefficiently) in a loop, but, my DBA background encourages me to seek a solution that is a little bit more … direct.

In code:

a <- data.frame(
  A=rep(c('2013-10-30', '2014-11-6'), each=16*20),
  B=rep(1:8, each=2*20),
  C=rep(1:4, each=20),
  D=1:20
)

a$Val=rnorm(nrow(a))

library(plyr)
b <- ddply(a, ~B+C+A, summarise,
           mean_Val=mean(Val))

# Some subset criteria based on AOI group values
c <- subset(b, mean_Val <= 0)

# EDIT: Delete all the rows from a for which the
# key-triplets A,B,C are present in c
for (i in 1:nrow(c)) {
  c_row = c[i,]
  a <- a[ which( !(a$A==c_row$A & a$B==c_row$B & a$C==c_row$C) ), ]
}
# This is the loopy type of 'solution' I didn't want to use

Please feel free also to address unclarities in my question. I'd be happy to edit if you can point me in the right direction.

If we already created 3 datasets and want to subset the first "a" based on the elements of "c/c1", one option is anti_join from dplyr

library(dplyr)
anti_join(a, c1, by=c('A', 'B', 'C'))

Update

Or we could use a base R option with interaction to paste the columns of interest together in both datasets and check whether the elements of 2nd ('c') are in 1st ('a') using %in%. The logical index can be used to subset "a".

 a1 <- a[!(as.character(interaction(a[1:3], sep=".")) %in% 
          as.character(interaction(c[LETTERS[1:3]], sep="."))),]

Or as @David Arenburg mentioned, we may not need to create b, or c datasets to get the expected output. Using plyr, create a new mean column ("mean_Val") in "a" with mutate and subset the rows with mean greater than 0 (mean_Val >0)

 library(plyr)
 subset(ddply(a, ~B+C+A, mutate, mean_Val=mean(Val)), mean_Val>0)

Or a similar approach using dplyr

 library(dplyr)
  a %>%
     group_by(B, C, A) %>%
     mutate(mean_Val=mean(Val)) %>% 
     filter(mean_Val>0)

Or if we don't need the "mean" values as a column in "a", ave from base R could be used as well.

  a[!!with(a, ave(Val, B, C, A, FUN=function(x) mean(x)>0)),]

If we need to keep the mean_Val column (a variation proposed by @David Arenburg)

  subset(transform(a, Mean_Val = ave(Val, B, C, A, FUN = mean)),
                 Mean_Val > 0)

data

set.seed(24)
a <- data.frame(A= sample(LETTERS[1:3], 20, replace=TRUE), 
   B=sample(LETTERS[1:3], 20, replace=TRUE), C=sample(LETTERS[1:3], 
         20, replace=TRUE), D=rnorm(20))

b <- a %>% 
       group_by(A, B, C) %>% 
       summarise(D=sum(D))
set.seed(39)
c1 <- b[sample(1:nrow(b), 6, replace=FALSE),]

Here's a possible data.table solution which won't require creating neither b or c

library(data.table) 
as.data.table(a)[, if(mean(Val) > 0) .SD, by = list(B, C, A)]

Or similarly (If you also want the mean itself)

as.data.table(a)[, Mean_Val := mean(Val), list(B, C, A)][Mean_Val > 0]

来源：https://stackoverflow.com/questions/28393823/subsetting-a-data-frame-based-on-key-spanning-several-columns-in-another-summar

标签

dataframe

subset