问题
I'm trying to automate a process I've normally done in excel. This process consists of merge and compare different columns. For example:
df1:
sp|P07437|TBB5_HUMAN
sp|P10809|CH60_HUMAN
sp|P424|LPPRC_HUMAN
sp|P474|LRC_HUMAN
df2:
sp|P07437|TBB5_HUMAN
sp|P10809|CH60_HUMAN
sp|P42704|LPPRC_HUMAN
df3:
sp|P07437|TBB5_HUMAN
sp|P10788|CH70_HUMAN
sp|P42704|LPPRC_HUMAN
And the output is something like that:
sp|P07437|TBB5_HUMAN | sp|P07437|TBB5_HUMAN | sp|P07437|TBB5_HUMAN
sp|P10809|CH60_HUMAN | sp|P10809|CH60_HUMAN |
| | sp|P10788|CH70_HUMAN
sp|P424|LPPRC_HUMAN | |
sp|P474|LRC_HUMAN | |
| sp|P42704|LPPRC_HUMAN| sp|P42704|LPPRC_HUMAN
I was trying to use the function compare
or merge
link but I don't have this result. Do you know another function that I can use in this case?
More or less is something like Venn Diagram, that is exactly what I do after this in order to check that everything is good.
Here you are and a reproducible example:
df1 = data.frame(TEST1=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN", "sp|P424|LPPRC_HUMAN"))
df2 = data.frame(TEST2=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN"," sp|P42704|LPPRC_HUMAN"))
df3 = data.frame(TEST3=c("sp|P07437|TBB5_HUMAN","sp|P10788|CH70_HUMAN", "sp|P42704|LPPRC_HUMAN"))
Thank you very much.
回答1:
I'm using a slightly-modified version of your data, avoiding factor
s in the data. I also trimmed extra white-space, assuming it's a mistake in copy/paste.
df1 = data.frame(TEST1=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN", "sp|P424|LPPRC_HUMAN"),
stringsAsFactors = FALSE)
df2 = data.frame(TEST2=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN"," sp|P42704|LPPRC_HUMAN"),
stringsAsFactors = FALSE)
df3 = data.frame(TEST3=c("sp|P07437|TBB5_HUMAN","sp|P10788|CH70_HUMAN", "sp|P42704|LPPRC_HUMAN"),
stringsAsFactors = FALSE)
Since this kind of problem can easily extend to include more than the initial count of data.frames, I usually prefer to work with lists of data.frames, not explicit data.frames, if at all possible.
lst <- list(df1, df2, df3)
Now here's one method to get your desired results:
alltests <- unique(trimws(unlist(lst, recursive = TRUE)))
as.data.frame(
setNames(lapply(lst, function(a) alltests[ match(alltests, a[,1]) ]),
sapply(lst, names)),
stringsAsFactors = FALSE
)
# TEST1 TEST2 TEST3
# 1 sp|P07437|TBB5_HUMAN sp|P07437|TBB5_HUMAN sp|P07437|TBB5_HUMAN
# 2 sp|P10809|CH60_HUMAN sp|P10809|CH60_HUMAN <NA>
# 3 sp|P424|LPPRC_HUMAN <NA> <NA>
# 4 <NA> <NA> sp|P424|LPPRC_HUMAN
# 5 <NA> <NA> sp|P10809|CH60_HUMAN
This relies on (1) single-column data.frames (though that can be remedied); and (2) unique column names. Your suggested output did not imply any sort, so I opted to not do any sorting here; it's easy enough to use alltests <- sort(unique(...))
, though note that it's an alphabetic sort, not based on the numeric portion of substrings.
来源:https://stackoverflow.com/questions/43266579/merge-and-compare-different-columns-from-different-files