Creating a “spell check” that checks against a database with a reasonable runtime

后端 未结 6 1719
Happy的楠姐
Happy的楠姐 2021-01-31 05:22

I\'m not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user

6条回答
  •  萌比男神i
    2021-01-31 05:47

    I guess the Levenshtein distance is more useful here than the Hamming distance.

    Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:

    • 1 insertion (208)
      • aexample
      • bexample
      • cexample
      • ...
      • examplex
      • exampley
      • examplez
    • 1 deletion (7)
      • xample
      • eample
      • exmple
      • ...
      • exampl
    • 1 substitution (182)
      • axample
      • bxample
      • cxample
      • ...
      • examplz

    You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.

    Notice how most misspellings occur by doing the same operation with a different character:

    • 1 insertion (8)
      • ?example
      • e?xample
      • ex?ample
      • exa?mple
      • exam?ple
      • examp?le
      • exampl?e
      • example?
    • 1 deletion (7)
      • xample
      • eample
      • exmple
      • exaple
      • examle
      • exampe
      • exampl
    • 1 substitution (7)
      • ?xample
      • e?ample
      • ex?mple
      • exa?ple
      • exam?le
      • examp?e
      • exampl?

    That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.

    Example: User enters exaple (notice missing m).

    SELECT DISTINCT word
               FROM dictionary
              WHERE hint = '?exaple'
                 OR hint = 'e?xaple'
                 OR hint = 'ex?aple'
                 OR hint = 'exa?ple'
                 OR hint = 'exap?le'
                 OR hint = 'exapl?e'
                 OR hint = 'exaple?'
                 OR hint = 'xaple'
                 OR hint = 'eaple'
                 OR hint = 'exple'
                 OR hint = 'exale'
                 OR hint = 'exape'
                 OR hint = 'exapl'
                 OR hint = '?xaple'
                 OR hint = 'e?aple'
                 OR hint = 'ex?ple'
                 OR hint = 'exa?le'
                 OR hint = 'exap?e'
                 OR hint = 'exapl?'
    

    exaple with 1 insertion == exa?ple == example with 1 substitution

    See also: How does the Google “Did you mean?” Algorithm work?

提交回复
热议问题