Are there any classes in NLTK for text normalizing and canonizing?

问题

The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as:

converting all letters to lower or upper case
removing punctuation
converting numbers into words
removing accent marks and other diacritics
expanding abbreviations
removing stopwords or "too common" words
text canonicalization (tumor = tumour, it's = it is)

Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for aforementioned purposes are welcome. Thanks.

UPD. I have written a python library of text normalization for the text-to-speech purposes https://github.com/soshial/text-normalization. It might suit you as well.

回答1:

Also in NLTK spec a lot of (sub-)tasks are solved using purely python methods.

a) converting all letters to lower or upper case

text='aiUOd'
print text.lower()
>> 'aiuod'
print text.upper()
>> 'AIUOD'

b) removing punctuation

text='She? Hm, why not!'
puncts='.?!'
for sym in puncts:
    text= text.replace(sym,' ')
print text
>> 'She  Hm  why not '

c) converting numbers into words

Here, it would be not that wasy to write a fewliner, but there are a lot of already existing solutions, if you google it. Code snippets, libraries etc

d) removing accent marks and other diacritics

look up point b), just create the list with diacritics as puncts

e) expanding abbreviations

Create a dictionary with abbreviations:

text='USA and GB are ...'
abbrevs={'USA':'United States','GB':'Great Britain'}
for abbrev in abbrevs:
    text= text.replace(abbrev,abbrevs[abbrev])
print text
>> 'United States and Great Britain are ...'

f) removing stopwords or "too common" words

Create a list with stopwords:

text='Mary had a little lamb'
temp_corpus=text.split(' ')
stops=['a','the','had']
corpus=[token for token in temp_corpus if token not in stops]
print corpus
>> ['Mary', 'little', 'lamb']

g) text canonicalization (tumor = tumour, it's = it is)

for tumor-> tumour use regex.

Last, but not least, please note that all of the examples above usually need calibration on the real textes, I wrote them as the direction to go.

回答2:

I suggest using stopwords.words() for stopword removal. Supports following languages: Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish.

回答3:

I might be a little bit late, but this may be helpful. Here are the stop words for some languages (English, French, German, Finish, Hungarian, Turkish, Russian, Czech, Greek, Arabic, Chinese, Japanese, Korean, Catalan, Polish, Hebrew, Norwegian, Swedish, Italian, Portuguese and Spanish): https://pypi.python.org/pypi/many-stop-words

来源：https://stackoverflow.com/questions/9227527/are-there-any-classes-in-nltk-for-text-normalizing-and-canonizing

标签

python

nltk