longest common substring in R finding non-contiguous matches between the two strings

后端 未结 4 810
挽巷
挽巷 2020-11-29 07:21

I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However,

4条回答
  •  青春惊慌失措
    2020-11-29 08:16

    Here are three possible solutions.

    library(stringi)
    library(stringdist)
    
    a <- "hello"
    b <- "hel123l5678o"
    
    ## get all forward substrings of 'b'
    sb <- stri_sub(b, 1, 1:nchar(b))
    ## extract them from 'a' if they exist
    sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
    ## match the longest one
    sstr[which.max(nchar(sstr))]
    # [1] "hel"
    

    There are also adist() and agrep() in base R, and the stringdist package has a few functions that run the LCS method. Here's a look at stringsidt. It returns the number of unpaired characters.

    stringdist(a, b, method="lcs")
    # [1] 7
    
    Filter("!", mapply(
        stringdist, 
        stri_sub(b, 1, 1:nchar(b)),
        stri_sub(a, 1, 1:nchar(b)),
        MoreArgs = list(method = "lcs")
    ))
    #  h  he hel 
    #  0   0   0 
    

    Now that I've explored this a bit more, I think adist() might be the way to go. If we set counts=TRUE we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate() we can use that matrix to get the matches from a to b.

    ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
    # [1] "MMMIIIMIIIIM"
    

    So the M values denote straight across matches. We can go and get the substrings with stri_sub()

    stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
    # [1] "hel" "l"   "o" 
    

    Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.

提交回复
热议问题