“Large” scale spell checking in Python

元气小坏坏 提交于 2020-07-05 07:52:02

问题


Surprisingly I've been unable to find anyone else really doing this, but surely someone has. I'm working on a python project currently that involves spell checking some 16 thousand words. That number of words is only going to grow unfortunately. Right now I'm pulling words from Mongo, iterating through them, and then spell checking them with pyenchant. I've removed mongo as the potential bottleneck by grabbing all my items from there first. That leaves me with around 20 minutes to process through 16k words, which is obviously longer than I want to spend. This leaves me with a couple ideas/questions:

  1. Obviously I could leverage threading or some form of parallelism. Even if I chop this into 4 pieces, I'm still looking at roughly 5 minutes assuming peak performance.

  2. Is there a way to tell what spelling library Enchant is using underneath pyenchant? Enchant's website seems to imply it'll use all available spelling libraries/dictionaries when spell checking. If so, then I'm potentially running each word through three-four spelling dicts. This could be my issue right here, but I'm having a hard time proving that's the case. Even if it is, is my option really to uninstall other libraries? Sounds unfortunate.

So, any ideas on how I can squeeze at least a bit more performance out of this? I'm fine with chopping this into parallel tasks, but I'd still like to get the core piece of it to be a bit faster before I do.

Edit: Sorry, posting before morning coffee... Enchant generates a list of suggestions for me if a word is incorrectly spelled. That would appear to be where I spend most of my time in this processing portion.


回答1:


I think we agree that the performance bottleneck here is Enchant; for this size of dataset it's nearly instantaneous to do a boolean isSpeltCorrectly. So, why not:

  1. Build a set in memory of correctly-spelt words, using the dictionaries that Enchant does or fetching your own (e.g. OpenOffice's).

    Optionally, uniquify the document's words, say by putting them in a set. This probably won't save you very much.

  2. Check whether each word is in the set or not. This is fast, because it's just a set lookup. (Probably O(log N) where N is the number of words? assuming set buckets by hash and does a binary search... a Python guru can correct me here.)

  3. If it isn't, then ask Enchant to recommend a word for it. This is necessarily slow.

This assumes that most of your words are spelt correctly; if they aren't, you'll have to be cleverer.




回答2:


I would use A Peter Norvig style spell checker. I've written a complete post on this.

http://blog.mattalcock.com/2012/12/5/python-spell-checker/

Here's a snippet of the code that looks at possible edits of the word to check.

def edits1(word):
    s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes    = [a + b[1:] for a, b in s if b]
    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
    replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
    inserts    = [a + c + b     for a, b in s for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

You should be iterate through your growing data file of words to check extremely quickly with this code to check. See the full post for more information:

http://blog.mattalcock.com/2012/12/5/python-spell-checker/




回答3:


Perhaps a better way of doing this would be to compress the document, as this would remove any repeating instances of words, which you actually only need to spell check once. I only suggest this as it would probably perform faster than writing your own unique word finder.

The compressed version should have references to the unique words, somewhere within its file, you might have to look up how they are structured.

You can then spell check all the unique words. I hope you are not checking them with individual SQL queries or something like that, you should load a dictionary in the form of a tree into your memory and then check words against that.

Once this is done, simply uncompress it and hey presto it's all spell checked. This should be a fairly fast solution.

Or perhaps you don't need to go through the whole zipping process if spell checking really is as fast as the comments suggest, which would indicate a wrong implementation.



来源:https://stackoverflow.com/questions/3449968/large-scale-spell-checking-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!