问题
I have a data frame a
with 4 identifying columns: A, B, C, D
. A second data frame b
, created with ddply()
, contains a summary of all the values for different D
s for every set of A,B,C
. A third data frame c
contains a subset of b
with bad values that I want to delete from a
.
Thus, I want a subset from a
, omitting all the rows identified by a combination of A,B,C
that are also present in c
. I can think of ways do this (ugly and inefficiently) in a loop, but, my DBA background encourages me to seek a solution that is a little bit more … direct.
In code:
a <- data.frame(
A=rep(c('2013-10-30', '2014-11-6'), each=16*20),
B=rep(1:8, each=2*20),
C=rep(1:4, each=20),
D=1:20
)
a$Val=rnorm(nrow(a))
library(plyr)
b <- ddply(a, ~B+C+A, summarise,
mean_Val=mean(Val))
# Some subset criteria based on AOI group values
c <- subset(b, mean_Val <= 0)
# EDIT: Delete all the rows from a for which the
# key-triplets A,B,C are present in c
for (i in 1:nrow(c)) {
c_row = c[i,]
a <- a[ which( !(a$A==c_row$A & a$B==c_row$B & a$C==c_row$C) ), ]
}
# This is the loopy type of 'solution' I didn't want to use
Please feel free also to address unclarities in my question. I'd be happy to edit if you can point me in the right direction.
回答1:
If we already created 3 datasets and want to subset the first "a" based on the elements of "c/c1", one option is anti_join
from dplyr
library(dplyr)
anti_join(a, c1, by=c('A', 'B', 'C'))
Update
Or we could use a base R
option with interaction
to paste the columns of interest together in both datasets and check whether the elements of 2nd ('c') are in 1st ('a') using %in%
. The logical index can be used to subset "a".
a1 <- a[!(as.character(interaction(a[1:3], sep=".")) %in%
as.character(interaction(c[LETTERS[1:3]], sep="."))),]
Or as @David Arenburg mentioned, we may not need to create b
, or c
datasets to get the expected output. Using plyr
, create a new mean column ("mean_Val") in "a" with mutate
and subset
the rows with mean greater than 0 (mean_Val >0
)
library(plyr)
subset(ddply(a, ~B+C+A, mutate, mean_Val=mean(Val)), mean_Val>0)
Or a similar approach using dplyr
library(dplyr)
a %>%
group_by(B, C, A) %>%
mutate(mean_Val=mean(Val)) %>%
filter(mean_Val>0)
Or if we don't need the "mean" values as a column in "a", ave
from base R
could be used as well.
a[!!with(a, ave(Val, B, C, A, FUN=function(x) mean(x)>0)),]
If we need to keep the mean_Val
column (a variation proposed by @David Arenburg)
subset(transform(a, Mean_Val = ave(Val, B, C, A, FUN = mean)),
Mean_Val > 0)
data
set.seed(24)
a <- data.frame(A= sample(LETTERS[1:3], 20, replace=TRUE),
B=sample(LETTERS[1:3], 20, replace=TRUE), C=sample(LETTERS[1:3],
20, replace=TRUE), D=rnorm(20))
b <- a %>%
group_by(A, B, C) %>%
summarise(D=sum(D))
set.seed(39)
c1 <- b[sample(1:nrow(b), 6, replace=FALSE),]
回答2:
Here's a possible data.table
solution which won't require creating neither b
or c
library(data.table)
as.data.table(a)[, if(mean(Val) > 0) .SD, by = list(B, C, A)]
Or similarly (If you also want the mean itself)
as.data.table(a)[, Mean_Val := mean(Val), list(B, C, A)][Mean_Val > 0]
来源:https://stackoverflow.com/questions/28393823/subsetting-a-data-frame-based-on-key-spanning-several-columns-in-another-summar