Conditional filter of grouped factors - dplyr

问题

Say I have this sort of dataframe:

   day value group type id
1    1   0.1     A    X  1
2    1   0.4     A    Y  1
3    2   0.2     A    X  3
4    2   0.5     A    Y  3
5    3   0.3     A    X  5
6    3   0.2     A    Y  6
7    1   0.1     B    X  3
8    1   0.3     B    Y  3
9    2   0.1     B    X 11
10   2   0.4     B    Y 10
11   3   0.2     B    X 12
12   3   0.3     B    Y 12
13   1   0.1     C    X 12
14   1   0.3     C    Y 12
15   2   0.3     C    X  5
16   2   0.2     C    Y  5
17   3   0.2     C    X  3
18   3   0.2     C    Y  2

Data:

library(dplyr)
df1 <- data.frame(
day = rep(1:3,6),
value = c(0.1,0.2,0.3,0.4,0.5,0.2,0.1,0.1,0.2,0.3,0.4,0.3, 0.1,0.3,0.2,0.3,0.2,0.2),
group = rep(LETTERS[1:3], each=6)
) %>% 
  arrange(group,day) %>% 
  mutate(type=rep(LETTERS[24:25],9),
         id = c(1,1,3,3,5,6,3,3,11,10,12,12,12,12,5,5,3,2))  

df1

I want to filter this dataframe based upon a conditional filter. I want to group_by(day, group) and if all id within each grouping are all equal, I want to filter out all rows of type Y but keep rows of type X.

I can do this by running a loop or through several steps of dataframe subsetting, but I was wondering if there is a one/two liner in dplyr or data.table that I am somehow overlooking.

This would be the desired output:

   day value group type id
1    1   0.1     A    X  1
3    2   0.2     A    X  3
5    3   0.3     A    X  5
6    3   0.2     A    Y  6
7    1   0.1     B    X  3
9    2   0.1     B    X 11
10   2   0.4     B    Y 10
11   3   0.2     B    X 12
13   1   0.1     C    X 12
15   2   0.3     C    X  5
17   3   0.2     C    X  3
18   3   0.2     C    Y  2

回答1:

Here is an one-liner with data.table.

We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'day', 'group', if the length of unique elements of 'id' is 1, we get the Subset of Data.table (.SD) rows where 'type' is 'X' or else get the .SD.

library(data.table)#v1.9.6+
setDT(df1)[, if(uniqueN(id)==1) .SD[type=='X'] else .SD, .(day, group)]
#    day group value type id
# 1:   1     A   0.1    X  1
# 2:   2     A   0.2    X  3
# 3:   3     A   0.3    X  5
# 4:   3     A   0.2    Y  6
# 5:   1     B   0.1    X  3
# 6:   2     B   0.1    X 11
# 7:   2     B   0.4    Y 10
# 8:   3     B   0.2    X 12
# 9:   1     C   0.1    X 12
#10:   2     C   0.3    X  5
#11:   3     C   0.2    X  3
#12:   3     C   0.2    Y  2

Or if the 'type' is already ordered as in the example data

unique(setDT(df1), by = c('day', 'group', 'id'))

If not ordered,

unique(setDT(df1)[order(group,day, id, type)],by = c('day', 'group' , 'id'))

data

df1 <- structure(list(day = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 
2L, 2L, 
3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), value = c(0.1, 0.4, 0.2, 0.5, 
0.3, 0.2, 0.1, 0.3, 0.1, 0.4, 0.2, 0.3, 0.1, 0.3, 0.3, 0.2, 0.2, 
0.2), group = c("A", "A", "A", "A", "A", "A", "B", "B", "B", 
"B", "B", "B", "C", "C", "C", "C", "C", "C"), type = c("X", "Y", 
"X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", 
"Y", "X", "Y"), id = c(1L, 1L, 3L, 3L, 5L, 6L, 3L, 3L, 11L, 10L, 
12L, 12L, 12L, 12L, 5L, 5L, 3L, 2L)), .Names = c("day", "value", 
"group", "type", "id"), class = "data.frame", 
row.names = c(NA, -18L))

回答2:

Similar to P Lapointe, I had the following. I initially wanted to use all() to check if all id values are identical or not, but that attempt failed. So, I chose to use diff(). Using mutate() I checked if all id values are identical or not for each group. Then, I chose rows which does not have the combination of check == TRUE and type == "Y". Finally, I removed the check column.

group_by(df1, day, group) %>%
mutate(check = any(diff(id) == 0)) %>%
filter(!(check == TRUE & type == "Y")) %>%
select(-check)

#     day value  group  type    id
#   (int) (dbl) (fctr) (chr) (dbl)
#1      1   0.1      A     X     1
#2      2   0.2      A     X     3
#3      3   0.3      A     X     5
#4      3   0.2      A     Y     6
#5      1   0.1      B     X     3
#6      2   0.1      B     X    11
#7      2   0.4      B     Y    10
#8      3   0.2      B     X    12
#9      1   0.1      C     X    12
#10     2   0.3      C     X     5
#11     3   0.2      C     X     3
#12     3   0.2      C     Y     2

EDIT

After communicating with akrun, I revised the code above. Here it is.

group_by(df1, day, group) %>%
mutate(check = n_distinct(id) == 1) %>%
filter(!(check == TRUE & type == "Y")) %>%
select(-check)

回答3:

Here's my solution with dplyr.

df1%>%
group_by(day,group)%>%
mutate(len=length(unique(id))==1)%>%
filter(ifelse(len==TRUE,type=="X",type %in% c("X","Y")))

Source: local data frame [12 x 6]
Groups: day, group

   day value group type id   len
1    1   0.1     A    X  1  TRUE
2    2   0.2     A    X  3  TRUE
3    3   0.3     A    X  5 FALSE
4    3   0.2     A    Y  6 FALSE
5    1   0.1     B    X  3  TRUE
6    2   0.1     B    X 11 FALSE
7    2   0.4     B    Y 10 FALSE
8    3   0.2     B    X 12  TRUE
9    1   0.1     C    X 12  TRUE
10   2   0.3     C    X  5  TRUE
11   3   0.2     C    X  3 FALSE
12   3   0.2     C    Y  2 FALSE

回答4:

Try this:

 df1 %>% group_by(day,group) %>% distinct(id)

Your original data is ordered, so it is ok, otherwise try

df1 %>% group_by(day, group) %>%
    arrange(group, day, type) %>% distinct(id) %>%
    ungroup %>%‌ arrange(group, day, type, id)

来源：https://stackoverflow.com/questions/33715660/conditional-filter-of-grouped-factors-dplyr

标签

group-by

dplyr