Removing non-English words from text using Python

末鹿安然 提交于 2019-11-29 12:30:45

问题


I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

For example given some text :

"Io andiamo to the beach with my amico."

I would like to be left with :

"to the beach with my" 

Does anyone know of a way as to how this could be done? Any help would be much appreciated.


回答1:


You can use the words corpus from NLTK:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.



来源:https://stackoverflow.com/questions/41290028/removing-non-english-words-from-text-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!