Select rows with common ids in grouped data frame

大兔子大兔子 提交于 2020-05-08 18:53:11

问题


I am searching for a simpler solution to the following problem. Here is my setup:

test <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_3",     5,   "g",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_3",     8,   "k",
     "groupC",   "id_4",     9,   "l",
     "groupC",   "id_5",     0,   "o",
     "groupC",   "id_6",    12,   "u"
  )

I want to select only those elements in id_name that are common to all groups in group_name - i.e. drop the rows for ids that are not present in all the groups. My actual data is large (200k rows) with anywhere between 4-20 groups (I don't know the number of groups beforehand so the solution must work for any number of groups). The id_name in each group is NOT unique. The desired result would be:

test_result <- tibble::tribble(
  ~group_name, ~id_name, ~varA, ~varB,
     "groupA",   "id_1",     1,   "a",
     "groupA",   "id_2",     4,   "f",
     "groupA",   "id_4",     6,   "x",
     "groupA",   "id_4",     6,   "h",
     "groupB",   "id_1",     2,   "s",
     "groupB",   "id_2",    13,   "y",
     "groupB",   "id_4",    14,   "t",
     "groupC",   "id_1",     3,   "d",
     "groupC",   "id_2",     7,   "j",
     "groupC",   "id_4",     9,   "l",
  )

(the rows with ids absent in at least one group are dropped). Ideally I do not want my output to have the columns joined at the end. I want "simply" to drop the rows missing in any one group but maintain the shape of the dataframe.

And I know that I can extract all the ids from each group, then intersect them all to obtain the list of unique ids present in all groups and then filter the main dataframe for just these IDs. But that sounds like a lot of work ;-)

Any hints would be very much appreciated.


回答1:


In base R, we can split id_name by group_name find common id's and then subset

subset(test, id_name %in% Reduce(intersect, split(id_name, group_name)))

#   group_name id_name  varA varB 
#   <chr>      <chr>   <dbl> <chr>
# 1 groupA     id_1        1 a    
# 2 groupA     id_2        4 f    
# 3 groupA     id_4        6 x    
# 4 groupA     id_4        6 h    
# 5 groupB     id_1        2 s    
# 6 groupB     id_2       13 y    
# 7 groupB     id_4       14 t    
# 8 groupC     id_1        3 d    
# 9 groupC     id_2        7 j    
#10 groupC     id_4        9 l    

Using similar concept in tidyverse, it would be

library(tidyverse)
test %>%
  filter(id_name %in% (test %>%
                         group_split(group_name)  %>%
                         map(~pull(., id_name)) %>%
                         reduce(intersect)))



回答2:


You can tabulate the the occurrence of id_name by group_name:

table(test$group_name,test$id_name)

If id_name is present in every group, we want columns that have all > 0 entris. We can simplify this logic using a combination of >0 and colMeans:

keep = names(which(colMeans(table(test$group_name,test$id_name)>0)==1))

Using this:

test[test$id_name %in% keep,]


来源:https://stackoverflow.com/questions/58901790/select-rows-with-common-ids-in-grouped-data-frame

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!