String matching to estimate similarity

I want to analyse a field of 100 character length and estimate similarity %. For example, for a same question "Whats your opinion on smartphone?",

Person A: "Best way to waste money"

Person B: "Amazing stuff. lets you stay connected all the time"

Person C: "Instrument to waste money and time"

Out of these, just by matching individual words, A and C sound similar. I am trying to do something like this to start with in R and later on extend to match combination of words like "Best", "Best way", "Best way waste" etc. I am newbie to text analysis and R and could not get the proper naming of these methods to search effectively.

Please guide me with your inputs and references. Thanks In Advance

Here is a potential solution for manually looking at percent similarity.

a <- "Best way to waste money"
b <- "Amazing stuff. lets you stay connected all the time"
c <- "Instrument to waste money and time"

format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
  lower <- tolower(string1)
  no.punct <- gsub("[[:punct:]]", "", lower)
  split <- strsplit(no.punct, split=" ")
  return(split)
}

a <- format(a)
b <- format(b)
c <- format(c)

sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
  sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
  total <- length(str1[[1]])
  per <- sim/total
  return(per)
}

#test
sim.per(b, c)

I hope that helps! To search for combinations of words you would have to do some more wizardry. I guess try and make an edit to show exactly what you're looking for and you might have more luck with an answer!

As for references, check out "Handling and Processing Strings in R" by Gaston Sanchez, it's great.

来源：https://stackoverflow.com/questions/22936951/string-matching-to-estimate-similarity

标签

string

text-mining

text-analysis