Removing non-english words from a sentence in python

纵饮孤独 提交于 2019-12-10 23:49:49

问题


I have written a code which sends queries to Google and returns the results. I extract the snippets(summaries) from these results for further processing. However, sometime non-english words are in these snippets which I don't want them. for example:

/\u02b0w\u025bn w\u025bn unstressed \u02b0w\u0259n w\u0259n/ 

I only want the "unstressed" word in this sentence. How can I do that? thanks


回答1:


PyEnchant might be a simple option for you. I do not know about its speed, but you can do things like:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

A tutorial is found here, it also has options to return suggestions which you can you again for another query or something. In addition you can check if your result is in latin-1 (is_utf8() excists, do not know if is_latin-1() does also, maybe use something like Enca which detects the encoding of text files, on the basis of knowledge of their language.)




回答2:


You can compare the words you receive with a dictionary of english words, for example /usr/share/dict/words on a BSD system.

I would guess that googles results for the most part is grammatically correct, but if not, you might have to look into stemming in order to match against your dictionary.




回答3:


You can use PyWordNet. That is a python interface for the WordNet. Just split your sentence on white spaces and check for each word is it in the dictionary.



来源:https://stackoverflow.com/questions/4031556/removing-non-english-words-from-a-sentence-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!