String similarity score/hash

前端 未结 12 1226
长发绾君心
长发绾君心 2020-12-07 09:52

Is there a method to calculate something like general \"similarity score\" of a string? In a way that I am not comparing two strings together but rather I get some number (h

12条回答
  •  时光说笑
    2020-12-07 10:34

    In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.

    Imagine similarity at the character level

    stops
    spots
    
    hello world
    world hello
    

    In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)

    Then compare the following similar messages

    hello world
    ello world
    

    Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.

    In the case of

    spots
    stops
    

    The words only differ by position of the characters, so some form of position is important.

    If the following strings are similar

     yesssssssssssssss
     yessssssssssssss
    

    Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.

    In general this is treated as a multi-dimensional problem - breaking the string into a vector

    [ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]
    

    But the values of the vector can not be

    • represented by a fixed size number, or
    • give good quality difference measure.

    If the number of words, or length of strings were bounded, then a solution of coding may be possible.

    Bounded values

    Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.

    data mining solution

    If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.

    I have code for such at github : clustering

    Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.

    Edit Distance or Levenshtein distance

    This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.

提交回复
热议问题