String similarity score/hash

前端未结

关注

 12  1229

长发绾君心 2020-12-07 09:52

Is there a method to calculate something like general \"similarity score\" of a string? In a way that I am not comparing two strings together but rather I get some number (h

12条回答

孤城傲影 (楼主)

2020-12-07 10:50
Levenstein distance or its derivatives is the algorithm you want. Match given string to each of strings from dictionary. (Here, if you need only fixed number of most similar strings, you may want to use min-heap.) If running Levenstein distance for all strings in dictionary is too expensive, then use some rough algorithm first that will exclude too distant words from list of candidates. After that, run levenstein distance on left candidates.

One way to remove distant words is to index n-grams. Preprocess dictionary by splitting each of words into list of n-grams. For example, consider n=3:
```
(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]
```
Next, create index of n-gramms:
```
" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]
```
When you need to find most similar strings for given string, you split given string into n-grams and select only those words from dictionary which have at least one matching n-gram. This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.

If your strings are long enough, you may reduce index size by using min-hashing technnique: you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.

P.S. this presentation seems like a good introduction to your problem.
0 讨论(0)

查看其它12个回答
发布评论:

提交评论
- 加载中...