Creating a “spell check” that checks against a database with a reasonable runtime

后端未结

关注

 6  1719

Happy的楠姐 2021-01-31 05:22

I\'m not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user

6条回答

萌比男神i (楼主)

2021-01-31 05:47
I guess the Levenshtein distance is more useful here than the Hamming distance.

Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:
- 1 insertion (208)
  - aexample
  - bexample
  - cexample
  - ...
  - examplex
  - exampley
  - examplez
- 1 deletion (7)
  - xample
  - eample
  - exmple
  - ...
  - exampl
- 1 substitution (182)
  - axample
  - bxample
  - cxample
  - ...
  - examplz
You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.

Notice how most misspellings occur by doing the same operation with a different character:
- 1 insertion (8)
  - ?example
  - e?xample
  - ex?ample
  - exa?mple
  - exam?ple
  - examp?le
  - exampl?e
  - example?
- 1 deletion (7)
  - xample
  - eample
  - exmple
  - exaple
  - examle
  - exampe
  - exampl
- 1 substitution (7)
  - ?xample
  - e?ample
  - ex?mple
  - exa?ple
  - exam?le
  - examp?e
  - exampl?
That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.

Example: User enters exaple (notice missing m).
```
SELECT DISTINCT word
           FROM dictionary
          WHERE hint = '?exaple'
             OR hint = 'e?xaple'
             OR hint = 'ex?aple'
             OR hint = 'exa?ple'
             OR hint = 'exap?le'
             OR hint = 'exapl?e'
             OR hint = 'exaple?'
             OR hint = 'xaple'
             OR hint = 'eaple'
             OR hint = 'exple'
             OR hint = 'exale'
             OR hint = 'exape'
             OR hint = 'exapl'
             OR hint = '?xaple'
             OR hint = 'e?aple'
             OR hint = 'ex?ple'
             OR hint = 'exa?le'
             OR hint = 'exap?e'
             OR hint = 'exapl?'
```
exaple with 1 insertion == exa?ple == example with 1 substitution

See also: How does the Google “Did you mean?” Algorithm work?
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...