By fuzzy matching I don\'t mean similar strings by Levenshtein distance or something similar, but the way it\'s used in TextMate/Ido/Icicles: given a list of strings, find t
I'm actually building something similar to Vim's Command-T and ctrlp plugins for Emacs, just for fun. I have just had a productive discussion with some clever workmates about ways to do this most efficiently. The goal is to reduce the number of operations needed to eliminate files that don't match. So we create a nested map, where at the top-level each key is a character that appears somewhere in the search set, mapping to the indices of all the strings in the search set. Each of those indices then maps to a list of character offsets at which that particular character appears in the search string.
In pseudo code, for the strings:
We'd build a map like this:
{
"c" => {
0 => [0]
},
"o" => {
0 => [1, 5],
1 => [1]
},
"n" => {
0 => [2]
},
"t" => {
0 => [3]
},
"r" => {
0 => [4, 9]
},
"l" => {
0 => [6, 7],
1 => [4]
},
"e" => {
0 => [9],
1 => [3],
2 => [2]
},
"m" => {
1 => [0]
},
"d" => {
1 => [2]
},
"v" => {
2 => [0]
},
"i" => {
2 => [1]
},
"w" => {
2 => [3]
}
}
So now you have a mapping like this:
{
character-1 => {
word-index-1 => [occurrence-1, occurrence-2, occurrence-n, ...],
word-index-n => [ ... ],
...
},
character-n => {
...
},
...
}
Now searching for the string "oe":
{0 => 1, 1 => 1}
.{0 => 9, 1 => 3}
.Now by looking at the keys in our map that we've accumulated, we know which strings matched the fuzzy search.
Ideally, if the search is being performed as the user types, you'll keep track of the accumulated hash of results and pass it back into your search function. I think this will be a lot faster than iterating all search strings and performing a full wildcard search on each one.
The interesting thing about this is that you could also efficient store the Levenstein Distance along with each match, assuming you only care about insertions, not substitutions or deletions. Though perhaps it's not hard to get that logic added too.