String similarity score/hash

前端未结

关注

 12  1221

Is there a method to calculate something like general \"similarity score\" of a string? In a way that I am not comparing two strings together but rather I get some number (h

相关标签:

12条回答

悲&欢浪女

2020-12-07 10:31
I think of something like this:
1. remove all non-word characters
2. apply soundex
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-12-07 10:33

Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.

Just an idea.

ready-to-run PCA in C#

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-12-07 10:34
In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.

Imagine similarity at the character level
```
stops
spots

hello world
world hello
```
In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)

Then compare the following similar messages
```
hello world
ello world
```
Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.

In the case of
```
spots
stops
```
The words only differ by position of the characters, so some form of position is important.

If the following strings are similar
```
 yesssssssssssssss
 yessssssssssssss
```
Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.

In general this is treated as a multi-dimensional problem - breaking the string into a vector
```
[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]
```
But the values of the vector can not be
- represented by a fixed size number, or
- give good quality difference measure.
If the number of words, or length of strings were bounded, then a solution of coding may be possible.

Bounded values

Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.

data mining solution

If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.

I have code for such at github : clustering

Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.

Edit Distance or Levenshtein distance

This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.
0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-07 10:37

Would Levenshtein distance work for you?

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-12-07 10:46

Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.

Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.

And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.

Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-07 10:46

It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.

The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.

In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

String similarity score/hash

Bounded values

data mining solution

Edit Distance or Levenshtein distance