How does clustering (especially String clustering) work?

前端 未结 3 1196
轻奢々
轻奢々 2020-12-07 17:56

I heard about clustering to group similar data. I want to know how it works in the specific case for String.

I have a table with more than different 100,000 words. <

3条回答
  •  执笔经年
    2020-12-07 18:33

    There is a package called stringdist that allows for string comparison using several different methods. Copypasting from that page:

    • Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
    • Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
    • (Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
    • Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
    • Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
    • q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
    • Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
    • Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
    • Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
    • Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].

    That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.

提交回复
热议问题