Is there an edit distance algorithm that takes “chunk transposition” into account?

后端 未结 6 2130
被撕碎了的回忆
被撕碎了的回忆 2021-02-04 14:54

I put \"chunk transposition\" in quotes because I don\'t know whether or what the technical term should be. Just knowing if there is a technical term for the process would be ve

6条回答
  •  刺人心
    刺人心 (楼主)
    2021-02-04 15:24

    You might find compression distance useful for this. See an answer I gave for a very similar question.

    Or you could use a k-tuple based counting system:

    1. Choose a small value of k, e.g. k=4.
    2. Extract all length-k substrings of your string into a list.
    3. Sort the list. (O(knlog(n) time.)
    4. Do the same for the other string you're comparing to. You now have two sorted lists.
    5. Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
    6. The number of k-tuples in common is your similarity score.

    With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.

提交回复
热议问题