Match 2 lists of strings by ressemblance

断了今生、忘了曾经 提交于 2019-12-07 17:33:36

问题


Problem

I have 2 lists of strings. I want to find the best matching pairs from my lists.

For example, I have those 2 lists:

list1 = {"a1","b1","c1"}
list2 = {"a2","b2","c2"}

I want to get the following results:

results = {{"a1,"a2"}, {"b1,"b2"}, {"c1,"c2"}}

Additional Info

To compare 2 strings together, I would like to use something similar to the Levenshtein distance. For example, when I compare "a1" with "a2", it gives me a shorter distance than "a1" with "b2", so "a1"+"a2" would be considered a better match.

I gets complicated when different pairs gets the same distance results. You can't just take minimum distance for a specific item in list1, because another item in list1 could obtain the same distance with the same item in list2.

Question

Do you have suggestions of algorithms for that?

Where I am right now

You better not look at my finding first so you don't get influenced by my work.

I calculate the Levenshtein distance for each possible pair of string and store the results in a 2-dimension array. Then I build a single dimension array where each element has:

  • the pair (the i,j indexes in my 2-dimension array)
  • the distance

Then I sort this array by using distance element.

Finally, I go through the sorted array and resolve the items with a common distance together (all distance==0 first, then all distance==1, etc...). Every time, I resolve an element, I mark it in my 2D array, so I can quickly skip the resolved items in my sorted array.

I think I can better than this solution. It may not the most efficient in time and space.


回答1:


Once you have established the metric you want to use to keep track of the "distance" between two strings, be it the Levenshtein distance or another one, you can use the Hungarian algorithm to solve your problem.

I personally have never implement it, but Wikipedia includes several links that might be of help.




回答2:


My suggestion for a possible optimization to this:

I calculate the Levenshtein distance for each possible pair of string and store the results in a 2-dimension array.

Is that you can avoid computing the distance for every possible pair of the string by considering their lengths. Because let's say:

1. if the pair is e.g. "ab", and "cdefg"
2. and you know that there's another string that has similar length with "ab" e.g. "xy"

Then you shouldn't need to calculate the distance between "ab" and "cdefg". Because the minimum distance you can get between strings of those lengths is 3, whereas the maximum distance between two strings of equal lengths ("ab" and "xy" as in the example) will be 2.

You can do this by using a smarter data structure that keeps track of length of strings e.g. unordered_map<int, vector<string> > in C++0x or tr1 C++.



来源:https://stackoverflow.com/questions/5587127/match-2-lists-of-strings-by-ressemblance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!