How do I group similar strings in R? [closed]

送分小仙女□ 提交于 2019-12-21 23:56:16

问题


I have a database with ~5,000 locality names, most of which are repetitions with typos, permutations, abreviations, etc. I would like to group them by similarity, to speed up further processing. The best would be to convert each variation into a "platonic form", and put two columns side by side, with the original and platonic forms. I've read about Multiple sequence alignment, but this seems to be mostly used in bioinformatics, for sequences of DNA/RNA/Peptides. I'm not sure it will work well with names of places. Anyone knows of a library that helps me to do it in R? Or which of the many algorithm variations might be easier to adapt?

EDIT: How do I do that in R? Up to now, I'm using adist() function, which gave me a matrix of distances between each pair of strings (although it don't treat translocations the way I think it should, see comment below). The next step I'm working right now is to turn this matrix into a grouping/clustering of similar enough values. Thanks in advance!

EDIT: To solve the translocations problem, I did a small function that gets all the words with more than 2 characters, sort them, removes any punctuation left, and paste them again into a string.

sep <- function(linha) {
    resp <- strsplit(linha," |/|-")
    resp <- unlist(resp)
    resp <- gsub(",|;|\\.","",resp)
    resp <- sort(resp[which(nchar(resp) > 2)])
    paste0(resp,collapse=" ")
}

Then I apply this over all lines of my table

locs[,9] <- apply(locs,1,function(x) sep(x[1])) # 1=original data; 9=new data

and finally apply adist() to create the similarity table.


回答1:


There's a built in function called "adist" that computes a measure of distance between two words.

It's like using "agrep", except it returns the distance, instead of whether the words match according to some approximate matching criteria.

For the special case of words that can be interchanged with a comma(e.g. "hello,world" should be close to "world,hello"), here's a quick hack. You can modify the function pretty easily if you have other special cases.

adist_special <- function(word1, word2){
    min(adist(word1, word2),
        adist(word1, gsub(word2, 
                          pattern = "(.*),(.*)", 
                          repl="\\2,\\1")))
}

adist("hello,world", "world,hello")

 # 8
adist_special("hello,world", "world,hello")

 # 0


来源:https://stackoverflow.com/questions/19961076/how-do-i-group-similar-strings-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!