R select all rows from a dataframe where a value is duplicated one column but has a specific value in another column

问题

I am trying to extract from my R dataframe, rows that have duplicate values in one column but which in another column have either a 0 or a 1.

For example, if this is the dataframe:

Data <- data.frame(
+     X = c(1,3,5,7,7,8,9,10,10,11,11),
+     Y = sample(36476545:36476555),
+ timepoint = c(0,1,0,0,1,1,0,1,0,1,1)
+ )

which looks like

> Data
    X        Y timepoint
1   1 36476549         0
2   3 36476545         1
3   5 36476552         0
4   7 36476547         0
5   7 36476546         1
6   8 36476548         1
7   9 36476551         0
8  10 36476555         1
9  10 36476553         0
10 11 36476554         1
11 11 36476550         1

My desired output will be all rows for which values are duplicated in X with timepoint = 0 for one occurrence of the value and 1 for the other occurrence, resulting in

> Data
    X        Y timepoint
4   7 36476547         0
5   7 36476546         1
8  10 36476555         1
9  10 36476553         0

Note that the last two items of data, which are also duplicated in X are not counted because the timepoint variable is 1 in both cases. There is a solution in SQL which comes close but I do not know how to code that in R.

Solution I tried by first creating a dataframe of the duplicates and then attempting to get the ones I want from there:

dupes <- Data[Data$X %in% 
Data$X[duplicated(Data$X)],]
ids <- Data$X[Data$timepoint==0]
Data[Data$X %in% ids,]

But this returned rows which do not have a duplicate entry. Any help would be appreciated, thank you!

回答1:

Is it what your were looking for?

library(dplyr)

Data <- data.frame(
     X = c(1,3,5,7,7,8,9,10,10,11,11),
     Y = sample(36476545:36476555),
 timepoint = c(0,1,0,0,1,1,0,1,0,1,1)
 )

Data %>% 
  group_by(X) %>% 
  filter(dplyr::n_distinct(timepoint) == 2)

n_distinct returns the number of distinct elements in a vector. Due to the group by statement, only rows with two distinct timepoints per group (X) are returned.

dplyr::n_distinct(c(1, 2, 1, 3, 1))

Data %>% 
  group_by(X) %>% 
  mutate(n_distinct = dplyr::n_distinct(timepoint))

来源：https://stackoverflow.com/questions/54583558/r-select-all-rows-from-a-dataframe-where-a-value-is-duplicated-one-column-but-ha

标签

dataframe

dplyr

duplicates

subset