Using NLTK tokenizer with utf8 [duplicate]

问题

I am a fairly new user of Python and I work mainly with imported text files, especially csv's, which give me headaches to process. I tried to read the docs like this one : https://docs.python.org/2/howto/unicode.html but I don't understand a clue of what is being said. I just want some straight down-to-earth explanation.

For instance I want to tokenize a large number of verbatims exported from the internet as a csv file. I want to use NLTK's tokenizer to do so.

Here's my code:

with open('verbatim.csv', 'r') as csvfile:
    reader = unicode_csv_reader(csvfile, dialect=csv.excel)
    for data in reader:
        tokens = nltk.word_tokenize(data)

When I do a print() on data I get clean text.

But when I use the tokenizer method, it returns the following error :

'ascii' codec can't decode byte 0xe9 in position 31: ordinal not in range(128)

It looks like an encoding problem. And it's always the same problem with every little manipulation I do with text. Can you help me with this ?

回答1:

This should do it:

with open('verbatim.csv') as csvfile:  # No need to set mode to 'r', r is default
    reader = unicode_csv_reader(csvfile, dialect=csv.excel)
    for data in reader:
        tokens = nltk.word_tokenize(unicode(data, 'utf-8'))

otherwise you can also try:

import codecs
with codecs.open('verbatim.csv', encoding='utf-8') as csvfile:
        reader = unicode_csv_reader(csvfile, dialect=csv.excel)
        for data in reader:
            tokens = nltk.word_tokenize(data)

回答2:

First you have to understand that str and unicode are two different types.

There is a lot of documentation and great talks about the subject. I think this is one of the best: https://www.youtube.com/watch?v=sgHbC6udIqc

If you are going to work with text you should really understand the differences.

Overly simplified, str is a sequence of bytes. unicode is a sequence of "characters" (code points), to get a sequence of bytes to you encode the unicode object with and encoding.

Yes, complicated. My suggestion, watch the video.

I'm not sure what your unicode_csv_reader does but I'm guessing the problem is there as nltk works with unicode. So I'm guessing that in unicode_csv_reader you are trying to encode/decode something with the wrong codec.

In [1]: import nltk

In [2]: nltk.word_tokenize(u'mi papá tiene 100 años')
Out[2]: [u'mi', u'pap\xe1', u'tiene', u'100', u'a\xf1os']

I would use the package unicodecsv from pypi. Which returns a list of unicode objects for each line that you can pass to nltk.

import unicodecsv
with open('verbatim.csv', 'r') as csvfile:
    reader = unicodecsv.reader(csvfile, dialect=csv.excel, encoding='iso-8859-1')
    for data in reader:
        tokens = nltk.word_tokenize(data)

you can provide and encoding to the reader, and there's no need to use codecs to open the file.

来源：https://stackoverflow.com/questions/36360111/using-nltk-tokenizer-with-utf8

标签

python

csv

nltk