How does clustering (especially String clustering) work?

前端未结

关注

 3  1196

轻奢々 2020-12-07 17:56

I heard about clustering to group similar data. I want to know how it works in the specific case for String.

I have a table with more than different 100,000 words. <

3条回答

执笔经年 (楼主)

2020-12-07 18:33
There is a package called stringdist that allows for string comparison using several different methods. Copypasting from that page:
- Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
- Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
- (Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
- Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
- Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
- q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
- Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
- Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
- Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
- Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].
That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...