I\'m not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user
I guess the Levenshtein distance is more useful here than the Hamming distance.
Let's take an example: We take the word example
and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:
You could store each misspelling in the database, and link that to the correct spelling, example
. That works and would be quite fast, but creates a huge database.
Notice how most misspellings occur by doing the same operation with a different character:
That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.
Example: User enters exaple
(notice missing m
).
SELECT DISTINCT word
FROM dictionary
WHERE hint = '?exaple'
OR hint = 'e?xaple'
OR hint = 'ex?aple'
OR hint = 'exa?ple'
OR hint = 'exap?le'
OR hint = 'exapl?e'
OR hint = 'exaple?'
OR hint = 'xaple'
OR hint = 'eaple'
OR hint = 'exple'
OR hint = 'exale'
OR hint = 'exape'
OR hint = 'exapl'
OR hint = '?xaple'
OR hint = 'e?aple'
OR hint = 'ex?ple'
OR hint = 'exa?le'
OR hint = 'exap?e'
OR hint = 'exapl?'
exaple
with 1 insertion == exa?ple
== example
with 1 substitution
See also: How does the Google “Did you mean?” Algorithm work?