R - find all unique values among subsets of a data frame

落花浮王杯 提交于 2019-12-02 05:55:14

问题


I have a data frame with two columns. The first column defines subsets of the data. I want to find all values in the second column that only appear in one subset in the first column.

For example, from:

df=data.frame(
  data_subsets=rep(LETTERS[1:2],each=5),
  data_values=c(1,2,3,4,5,2,3,4,6,7))

data_subsets data_values
      A           1
      A           2
      A           3
      A           4
      A           5
      B           2
      B           3
      B           4
      B           6
      B           7

I would want to extract the following data frame.

data_subsets   data_values
    A              1
    A              5
    B              6
    B              7

I have been playing around with duplicated but I just can't seem to make it work. Any help is appreciated. There are a number of topics tackling similar problems, I hope I didn't overlook the answer in my searches!

EDIT

I modified the approach from @Matthew Lundberg of counting the number of elements and extracting from the data frame. For some reason his approach was not working with the data frame I had, so I came up with this, which is less elegant but gets the job done:

counts=rowSums(do.call("rbind",tapply(df$data_subsets,df$data_values,FUN=table)))
extract=names(counts)[counts==1]
df[match(extract,df$data_values),]

回答1:


First, find the count of each element in df$data_values:

 x <- sapply(df$data_values, function(x) sum(as.numeric(df$data_values == x)))

> x
 [1] 1 2 2 2 1 2 2 2 1 1

Now extract the rows:

> df[x==1,]
   data_subsets data_values
1             A           1
5             A           5
9             B           6
10            B           7

Note that you missed "A 5" above. There is no "B 5".




回答2:


You had the right idea with duplicated. The trick is to combine fromLast = TRUE and fromLast = FALSE options to get a full list of non-duplicated rows.

!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE)
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

Indexing your data.frame with this vector gives:

df[!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE),]
   data_subsets data_values
1             A           1
5             A           5
9             B           6
10            B           7



回答3:


A variant of P Lapointe's answer would be

df[! df$data_values %in% df[duplicated( unique(df)$data_values ), ]$data_values,]

The unique() deals with the possibility (not in your test data) that some rows in the data may be identical and you want to keep them once if the same data_values does not appear for distinct data_sets (or distinct other columns).



来源:https://stackoverflow.com/questions/10873203/r-find-all-unique-values-among-subsets-of-a-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!