问题
I have a data frame with two columns contacting character strings. in one column (named probes) I have duplicated cases (that is, several cases with the same character string). for each case in probes I want to find all the cases containing the same string, and then merge the values of all the corresponding cases in the second column (named genes) into a single case.
for example, if I have this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1
3 cg00061679 DAZ4
4 cg00061679 DAZ4
I want to change it to this structure:
probes genes
1 cg00050873 TSPY4
2 cg00061679 DAZ1 DAZ4 DAZ4
obviously there is no problem doing this for a single probe using which, and then paste and collapse
ind<-which(olap$probes=="cg00061679")
genename<-(olap[ind,2])
genecomb<-paste(genename[1:length(genename)], collapse=" ")
but I'm not sure how to extract the indices of the duplicates in probes column across the whole data frame. any ideas?
Thanks in advance
回答1:
You can use tapply in base R
data.frame(probes=unique(olap$probes),
genes=tapply(olap$genes, olap$probes, paste, collapse=" "))
or use plyr:
library(plyr)
ddply(olap, "probes", summarize, genes = paste(genes, collapse=" "))
UPDATE
It's probably safer in the first version to do this:
tmp <- tapply(olap$genes, olap$probes, paste, collapse=" ")
data.frame(probes=names(tmp), genes=tmp)
Just in case unique gives the probes in a different order to tapply. Personally I would always use ddply.
回答2:
Base R aggregate() should work fine for this:
aggregate(genes ~ probes, data = olap, as.vector)
# probes genes
# 1 cg00050873 TSPY4
# 2 cg00061679 DAZ1, DAZ4, DAZ4
I prefer as.vector in case I need to do any further work on the data (this stores the genes column as a list, but you can also try aggregate(genes ~ probes, data=test, paste, collapse=" ") if you prefer it to be a character string.
来源:https://stackoverflow.com/questions/12054816/r-finding-duplicates-in-one-column-and-collapsing-in-a-second-column