Algorithms for “fuzzy matching” strings

后端未结

关注

 6  539

傲寒 2020-12-12 13:23

By fuzzy matching I don\'t mean similar strings by Levenshtein distance or something similar, but the way it\'s used in TextMate/Ido/Icicles: given a list of strings, find t

6条回答

北海茫月 (楼主)

2020-12-12 13:58
I'm actually building something similar to Vim's Command-T and ctrlp plugins for Emacs, just for fun. I have just had a productive discussion with some clever workmates about ways to do this most efficiently. The goal is to reduce the number of operations needed to eliminate files that don't match. So we create a nested map, where at the top-level each key is a character that appears somewhere in the search set, mapping to the indices of all the strings in the search set. Each of those indices then maps to a list of character offsets at which that particular character appears in the search string.

In pseudo code, for the strings:
- controller
- model
- view
We'd build a map like this:
```
{
  "c" => {
           0 => [0]
         },
  "o" => {
           0 => [1, 5],
           1 => [1]
         },
  "n" => {
           0 => [2]
         },
  "t" => {
           0 => [3]
         },
  "r" => {
           0 => [4, 9]
         },
  "l" => {
           0 => [6, 7],
           1 => [4]
         },
  "e" => {
           0 => [9],
           1 => [3],
           2 => [2]
         },
  "m" => {
           1 => [0]
         },
  "d" => {
           1 => [2]
         },
  "v" => {
           2 => [0]
         },
  "i" => {
           2 => [1]
         },
  "w" => {
           2 => [3]
         }
}
```
So now you have a mapping like this:
```
{
  character-1 => {
    word-index-1 => [occurrence-1, occurrence-2, occurrence-n, ...],
    word-index-n => [ ... ],
    ...
  },
  character-n => {
    ...
  },
  ...
}
```
Now searching for the string "oe":
1. Initialize a new map where the keys will be the indices of strings that match, and the values the offset read through that string so far.
2. Consume the first char from the search string "o" and look it up in the lookup table.
3. Since strings at indices 0 and 1 match the "o", put them into the map {0 => 1, 1 => 1}.
4. Now search consume the next char in the input string, "e" and loo it up in the table.
5. Here 3 strings match, but we know that we only care about strings 0 and 1.
6. Check if there are any offsets > the current offsets. If not, eliminate the items from our map, otherwise update the offset: {0 => 9, 1 => 3}.
Now by looking at the keys in our map that we've accumulated, we know which strings matched the fuzzy search.

Ideally, if the search is being performed as the user types, you'll keep track of the accumulated hash of results and pass it back into your search function. I think this will be a lot faster than iterating all search strings and performing a full wildcard search on each one.

The interesting thing about this is that you could also efficient store the Levenstein Distance along with each match, assuming you only care about insertions, not substitutions or deletions. Though perhaps it's not hard to get that logic added too.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...