问题
I am trying to compare strings like PRABHAKAR SHARMA and SHARMA KUMAR PRABHAKAR. the intention is to check if all the characters of the shorter string exist in the other string. If that is the case, I should get a 100% match otherwise a percentage representing the percentage of characters that matched.
I tried using levenshteinSim in RecordLinkage package but it gives a number corresponding to the number of changes required to change one string to another.
install.packages("RecordLinkage")
require(RecordLinkage)
levenshteinSim("PRABHAKAR SHARMA","SHARMA KUMAR PRABHAKAR")
#[1] 0.3636364
I want a 100% match in such a case. Also, this has to be replicated for over 1,000,000 records.
回答1:
Here is one approach
s1 <- "PRABHAKAR SHARMA"
s2 <- "SHARMA KUMAR PRABHAKAR"
compare <- function(s1, s2) {
c1 <- unique(strsplit(s1, "")[[1]])
c2 <- unique(strsplit(s2, "")[[1]])
length(intersect(c1,c2))/length(c1)
}
compare(s1,s2)
#1
It may be a little slow, though. And it considers the space character as character, too. Use Vectorize to apply on a column:
dat <- data.frame(small=c("a", "b"), big=c("aa", "cc"), stringsAsFactors=FALSE)
vcomp <- Vectorize(compare)
dat <- transform(dat, comp=vcomp(small, big))
回答2:
If the characters to be considered are only letters you could use:
comp <- function(s1, s2){
in1 = letters %in% strsplit(tolower(s1), "")[[1]]
in2 = letters %in% strsplit(tolower(s2), "")[[1]]
sum(in1 & in2)/sum(in1)
}
来源:https://stackoverflow.com/questions/36085290/check-if-all-characters-of-one-string-exist-in-another-string-in-r