How to fuzzily search for a dictionary word?

问题

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)

f(x, y) = 1, if characters x and y are similar
        = 0, otherwise

(These "similarities" can be specified by the programmer)

such that, say,

f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1

but,

f('a', 'z') = 0
etc.

Now if we have a query 'cofatyst', the algorithm should report the following matches:

('cot', 0)
('cat', 0)
('catalyst', 0)

where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).

One thing to note is that in the wild, the dictionary is going to be really large.

回答1:

I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.

Not a very specific advise, I know, but I hope it helps you.

edited: Expanded answer.

With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.

It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.

回答2:

I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).

The code is simple to understand the algorithm implementation done.

来源：https://stackoverflow.com/questions/16333766/how-to-fuzzily-search-for-a-dictionary-word

标签

algorithm

nlp

search-engine