How do spell checkers work?

跟風遠走 提交于 2019-11-28 17:05:16

Read up on Tree Traversal. The basic concept is as follows:

  1. Read a dictionary file into memory (this file contains the entire list of correctly spelled words that are possible/common for a given language). You can download free dictionary files online. One example is at java.sun.com
  2. Parse this dictionary file into a search tree to make the actual text search as efficient as possible. I won't describe all of the dirty details of this type of tree structure, but the tree will be made up of nodes which have (up to) 26 links to child nodes (one for each letter), plus a flag to indicate wether or not the current node is the end of a valid word.
  3. Loop through all of the words in your document, and check each one against the search tree. If you reach a node in the tree where the next letter in the word is not a valid child of the current node, the word is not in the dictionary. Also, if you reach the end of your word, and the "valid end of word" flag is not set on that node, the word is not in the dictionary.
  4. If a word is not found in the dictionary, inform the user. At this stage, you can also suggest alternate spellings, but that gets a tad more complicated. You will have to loop through each character in the word, substituting alternate characters and test each of them against the search tree. There are probably more efficient algorithms for finding the recommended words, but I don't know what they are.

A really short example:

Dictionary:

apex apple appoint appointed

Tree: (* indicates valid end of word) update: Thank you to Curt Sampson for pointing out that this data structure is called a Patricia Tree

A -> P -> E -> X*
      \\-> P -> L -> E*
           \\-> O -> I -> N -> T* -> E -> D*

Document:

apple appint ape

Results:

  • "apple" will be found in the tree, so it is considered correct.
  • "appint" will be flagged as incorrect. Traversing the tree, you will follow A -> P -> P, but the second P does not have an I child node, so the search fails.
  • "ape" will also fail, since the E node in A -> P -> E does not have the "valid end of word" flag set.

edit: For more details on spelling suggestions, look into Levenshtein Distance, which measures the smallest number of changes that must be made to convert one string into another. The best suggestions would be the dictionary words with the smallest Levenshtein Distance to the incorrectly spelled word.

Given you don't know where to begin, I'd suggest using an existing solution. See, for example, aspell (GLPL licenced). If you really have to implement it yourself, please tell us why.

One should look at prefixes and suffixes.

suddenly = sudden + ly.

by removing ly's you can get away storing just the root word.

Likewise preallocate = pre + allocate.

And lovingly = love + ing + ly gets a bit more complex, as the english rules for ing get invoked.

There is also the possibility of using some sort of hashing function to map a root word into a specific bit is a large bit map, as a constant time method of determining if the root word is spelled correctly.

You can get even more complex by trying to provide an alternate list of possible correct spellings to a misspelled word. You might research the soundex algorithm to get some ideas.

I would advise prototyping with a small set of words. Do a lot of testing, then scale up. It is a wonderful educational problem.

Splitting a word into root and suffix is knonw as the "Porter Stemming Algorithm" it's a good way of fitting an English ditionary into an amazingly small memory.
It's also useful for seach so "spell checker" will also find "spelling check" and "spell checking"

I've done this in class

You should consider python Natural Language Toolkit NLTK which is made specificaly to handle this.

It also allows to create text interpreters such as chatbots

The Open Office Spell checker Hunspell can be a good starting point. Here is the Homepage: Hunspell at Sourceforge

E James gives a great answer for how to tell if a word is valid. It probably depends on the spell checker for how they determine likely misspellings.

One such method, and the one that I would use is the Levenshteinn String Similarity which looks at how many letters must be added, removed, or swaped in a word in order to make another word.

If you say spelled: Country as Contry. The levenshtein string similarity would be 1 since you have to only add 1 letter to transform contry into country.

You could then loop through all possible correct spellings of words (only 171,000 english words and 3000 of those account for 95% of text). Determine those with the lowest levenshtein string similarity value, and then return the top X words that are most similar to the misspelled word.

There's a great python package called Fuzzy Wuzzy which implements this efficiently and generates a % similarity between two words or sentences based on this formula.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!