Select rows with common ids in grouped data frame

问题

I am searching for a simpler solution to the following problem. Here is my setup:

test <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_3",     5,   "g",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_3",     8,   "k",
     "groupC",   "id_4",     9,   "l",
     "groupC",   "id_5",     0,   "o",
     "groupC",   "id_6",    12,   "u"
  )

I want to select only those elements in id_name that are common to all groups in group_name - i.e. drop the rows for ids that are not present in all the groups. My actual data is large (200k rows) with anywhere between 4-20 groups (I don't know the number of groups beforehand so the solution must work for any number of groups). The id_name in each group is NOT unique. The desired result would be:

test_result <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_4",     9,   "l",
  )

(the rows with ids absent in at least one group are dropped). Ideally I do not want my output to have the columns joined at the end. I want "simply" to drop the rows missing in any one group but maintain the shape of the dataframe.

And I know that I can extract all the ids from each group, then intersect them all to obtain the list of unique ids present in all groups and then filter the main dataframe for just these IDs. But that sounds like a lot of work ;-)

Any hints would be very much appreciated.

回答1:

In base R, we can split id_name by group_name find common id's and then subset

subset(test, id_name %in% Reduce(intersect, split(id_name, group_name)))

#   group_name id_name  varA varB 
#   <chr>      <chr>   <dbl> <chr>
# 1 groupA     id_1        1 a    
# 2 groupA     id_2        4 f    
# 3 groupA     id_4        6 x    
# 4 groupA     id_4        6 h    
# 5 groupB     id_1        2 s    
# 6 groupB     id_2       13 y    
# 7 groupB     id_4       14 t    
# 8 groupC     id_1        3 d    
# 9 groupC     id_2        7 j    
#10 groupC     id_4        9 l

Using similar concept in tidyverse, it would be

library(tidyverse)
test %>%
  filter(id_name %in% (test %>%
                         group_split(group_name)  %>%
                         map(~pull(., id_name)) %>%
                         reduce(intersect)))

回答2:

You can tabulate the the occurrence of id_name by group_name:

table(test$group_name,test$id_name)

If id_name is present in every group, we want columns that have all > 0 entris. We can simplify this logic using a combination of >0 and colMeans:

keep = names(which(colMeans(table(test$group_name,test$id_name)>0)==1))

Using this:

test[test$id_name %in% keep,]

来源：https://stackoverflow.com/questions/58901790/select-rows-with-common-ids-in-grouped-data-frame

标签

dplyr

tidyverse