String matching to estimate similarity

爷,独闯天下 提交于 2019-12-04 16:55:10

Here is a potential solution for manually looking at percent similarity.

a <- "Best way to waste money"
b <- "Amazing stuff. lets you stay connected all the time"
c <- "Instrument to waste money and time"

format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
  lower <- tolower(string1)
  no.punct <- gsub("[[:punct:]]", "", lower)
  split <- strsplit(no.punct, split=" ")
  return(split)
}

a <- format(a)
b <- format(b)
c <- format(c)

sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
  sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
  total <- length(str1[[1]])
  per <- sim/total
  return(per)
}

#test
sim.per(b, c)

I hope that helps! To search for combinations of words you would have to do some more wizardry. I guess try and make an edit to show exactly what you're looking for and you might have more luck with an answer!

As for references, check out "Handling and Processing Strings in R" by Gaston Sanchez, it's great.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!