问题
I am trying to figure out a way in R to take the difference of two string vectors, but only based on the first 3 columns that are tab delimited in each string. For Example this is list1 and list2
list1:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
"1\t1180200\t1187599\t1\t1177632\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
list2:
"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
i want to do setdiff(list2,list1)
, so that i just get everything in list2 that is non-existent in list1, however i want to do it based on just the first 3 tab delimited strings. So in list1 i would only consider:
"1\t1113200\t1118399"
from the first entry. However i still want the full string returned. I only want to compare using the first 3 columns. I am having trouble figuring out how to do this, any help would be appreciated. Ive already looked at several SO posts, none of them seemed to help.
回答1:
For extracting the first three columns (not sure why you need this as a long string rather than a dataframe...), I would use beg2char()
from the qdap
library. (Although, if they are all the same length base substr()
will work fine.)
beg2char(list1, '\t', 3) # Will extract from the beginning up to the third tab delimiter
Then rather than setdiff
I would simply use %in%
to check if the substring of the element in list2
matches any of the elements in list1
.
beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3) # will give you TRUE/FALSE
list2[!(beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3))]
Will give the the full elements of list2
that have substring that are nonexistent in list1
.
来源:https://stackoverflow.com/questions/39679578/r-how-to-use-setdiff-on-two-string-vectors-by-only-comparing-the-first-3-tab-de