Are there any classes in NLTK for text normalizing and canonizing?

≡放荡痞女 提交于 2019-12-31 08:29:37

问题


The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as:

  • converting all letters to lower or upper case
  • removing punctuation
  • converting numbers into words
  • removing accent marks and other diacritics
  • expanding abbreviations
  • removing stopwords or "too common" words
  • text canonicalization (tumor = tumour, it's = it is)

Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for aforementioned purposes are welcome. Thanks.

UPD. I have written a python library of text normalization for the text-to-speech purposes https://github.com/soshial/text-normalization. It might suit you as well.


回答1:


Also in NLTK spec a lot of (sub-)tasks are solved using purely python methods.

a) converting all letters to lower or upper case

text='aiUOd'
print text.lower()
>> 'aiuod'
print text.upper()
>> 'AIUOD'

b) removing punctuation

text='She? Hm, why not!'
puncts='.?!'
for sym in puncts:
    text= text.replace(sym,' ')
print text
>> 'She  Hm  why not '

c) converting numbers into words

Here, it would be not that wasy to write a fewliner, but there are a lot of already existing solutions, if you google it. Code snippets, libraries etc

d) removing accent marks and other diacritics

look up point b), just create the list with diacritics as puncts

e) expanding abbreviations

Create a dictionary with abbreviations:

text='USA and GB are ...'
abbrevs={'USA':'United States','GB':'Great Britain'}
for abbrev in abbrevs:
    text= text.replace(abbrev,abbrevs[abbrev])
print text
>> 'United States and Great Britain are ...'

f) removing stopwords or "too common" words

Create a list with stopwords:

text='Mary had a little lamb'
temp_corpus=text.split(' ')
stops=['a','the','had']
corpus=[token for token in temp_corpus if token not in stops]
print corpus
>> ['Mary', 'little', 'lamb']

g) text canonicalization (tumor = tumour, it's = it is)

for tumor-> tumour use regex.

Last, but not least, please note that all of the examples above usually need calibration on the real textes, I wrote them as the direction to go.




回答2:


I suggest using stopwords.words() for stopword removal. Supports following languages: Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish.




回答3:


I might be a little bit late, but this may be helpful. Here are the stop words for some languages (English, French, German, Finish, Hungarian, Turkish, Russian, Czech, Greek, Arabic, Chinese, Japanese, Korean, Catalan, Polish, Hebrew, Norwegian, Swedish, Italian, Portuguese and Spanish): https://pypi.python.org/pypi/many-stop-words



来源:https://stackoverflow.com/questions/9227527/are-there-any-classes-in-nltk-for-text-normalizing-and-canonizing

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!