R finding duplicates in one column and collapsing in a second column

試著忘記壹切 提交于 2019-11-28 01:58:05

问题


I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case. for example, if I have this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1
3   cg00061679  DAZ4
4   cg00061679  DAZ4

I want to change it to this structure:

    probes  genes
1   cg00050873  TSPY4
2   cg00061679  DAZ1 DAZ4 DAZ4

obviously there is no problem doing this for a single probe using which, and then paste and collapse

ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")

but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?

Thanks in advance


回答1:


You can use tapply in base R

data.frame(probes=unique(olap$probes), 
           genes=tapply(olap$genes, olap$probes, paste, collapse=" "))

or use plyr:

library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))

UPDATE

It's probably safer in the first version to do this:

tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)

Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.




回答2:


Base R aggregate() should work fine for this:

aggregate(genes ~ probes, data = olap, as.vector)
#       probes            genes
# 1 cg00050873            TSPY4
# 2 cg00061679 DAZ1, DAZ4, DAZ4

I prefer as.vector in case I need to do any further work on the data (this stores the genes column as a list, but you can also try aggregate(genes ~ probes, data=test, paste, collapse=" ") if you prefer it to be a character string.



来源:https://stackoverflow.com/questions/12054816/r-finding-duplicates-in-one-column-and-collapsing-in-a-second-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!