Is there an edit distance algorithm that takes “chunk transposition” into account?

后端未结

关注

 6  2130

被撕碎了的回忆 2021-02-04 14:54

I put \"chunk transposition\" in quotes because I don\'t know whether or what the technical term should be. Just knowing if there is a technical term for the process would be ve

6条回答

刺人心 (楼主)

2021-02-04 15:24
You might find compression distance useful for this. See an answer I gave for a very similar question.

Or you could use a k-tuple based counting system:
1. Choose a small value of k, e.g. k=4.
2. Extract all length-k substrings of your string into a list.
3. Sort the list. (O(knlog(n) time.)
4. Do the same for the other string you're comparing to. You now have two sorted lists.
5. Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
6. The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...