I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However,
Leveraging @RichardScriven's insight that adist could be used, but this function combines it all,
EDIT This was tricky because we needed to get the longest_string in two contexts, so I made this function:
longest_string <- function(s){return(s[which.max(nchar(s))])}
This combines @RichardSriven's work using the library...
library(stringi)
library(stringdist)
lcsbstr <- function(a,b) {
sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
longest_cmn_sbstr <- longest_string(cmn_sbstr)
return(longest_cmn_sbstr)
}
We can rewrite it to avoid the use of any external libraries (but still using the adist)...
lcsbstr_no_lib <- function(a,b) {
matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
lengths<- attr(matches, 'match.length')
which_longest <- which.max(lengths)
index_longest <- matches[which_longest]
length_longest <- lengths[which_longest]
longest_cmn_sbstr <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
return(longest_cmn_sbstr )
}
All of identify only 'hello ' as the longest common substring, instead of 'hello r':
identical('hello ',
lcsbstr_no_lib('hello world', 'hello there'),
lcsbstr( 'hello world', 'hello there'))
EDIT And since the edit, works regardless of which argument is the longer of the two:
identical('hello',
lcsbstr_no_lib('hello', 'hello there'),
lcsbstr( 'hello', 'hello there'),
lcsbstr_no_lib('hello there', 'hello'),
lcsbstr( 'hello there', 'hello'))
LAST EDIT But this is only good if you accept this behavior. Notice this result:
lcsbstr('hello world', 'hello')
#[1] 'hell'
I was expecting 'hello', but since the transformation actually moves (via deletion) the world to become the hello, so only the hell part is considered a match according to the M:
drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1] vvvv v
#[1] "hello world"
This behavior is observed using [this Levenstein tool] -- it gives two possible solutions, equivalent to these two transforms; can we tell adist which one we prefer? (the one with the greater number of consecutive M)
#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"
Finally, don't forget adist allows you to pass in ignore.case = TRUE (FALSE is the default)