Similar String algorithm

后端未结

关注

 9  1395

梦如初夏 2020-11-29 20:10

I\'m looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...

Much like the question posed he

9条回答

Happy的楠姐 (楼主)

2020-11-29 20:27
One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.

Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
```
len(comp(s)) + len(comp(t)) - len(comp(s . t))
```
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.

Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
```
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
```
This technique has its roots in information theory.

Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).
0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...