I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case.
for example, if I have this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1
3 cg00061679 DAZ4
4 cg00061679 DAZ4
I want to change it to this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1 DAZ4 DAZ4
obviously there is no problem doing this for a single probe using which, and then paste and collapse
ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")
but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?
Thanks in advance
You can use tapply in base R
data.frame(probes=unique(olap$probes),
genes=tapply(olap$genes, olap$probes, paste, collapse=" "))
or use plyr:
library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))
UPDATE
It's probably safer in the first version to do this:
tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)
Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.
Base R aggregate() should work fine for this:
aggregate(genes ~ probes, data = olap, as.vector)
# probes genes
# 1 cg00050873 TSPY4
# 2 cg00061679 DAZ1, DAZ4, DAZ4
I prefer as.vector in case I need to do any further work on the data (this stores the genes column as a list, but you can also try aggregate(genes ~ probes, data=test, paste, collapse=" ") if you prefer it to be a character string.
来源:https://stackoverflow.com/questions/12054816/r-finding-duplicates-in-one-column-and-collapsing-in-a-second-column