Text clustering with Levenshtein distances

后端 未结 4 1325
暖寄归人
暖寄归人 2020-11-30 22:52

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clusteri

4条回答
  •  执念已碎
    2020-11-30 23:36

    This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.

    set.seed(1)
    rstr <- function(n,k){   # vector of n random char(k) strings
      sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
    }
    
    str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
    # Levenshtein Distance
    d  <- adist(str)
    rownames(d) <- str
    hc <- hclust(as.dist(d))
    plot(hc)
    rect.hclust(hc,k=3)
    df <- data.frame(str,cutree(hc,k=3))
    

    In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

提交回复
热议问题