问题
I am a fairly new user of Python and I work mainly with imported text files, especially csv's, which give me headaches to process. I tried to read the docs like this one : https://docs.python.org/2/howto/unicode.html but I don't understand a clue of what is being said. I just want some straight down-to-earth explanation.
For instance I want to tokenize a large number of verbatims exported from the internet as a csv file. I want to use NLTK's tokenizer to do so.
Here's my code:
with open('verbatim.csv', 'r') as csvfile:
reader = unicode_csv_reader(csvfile, dialect=csv.excel)
for data in reader:
tokens = nltk.word_tokenize(data)
When I do a print() on data I get clean text.
But when I use the tokenizer method, it returns the following error :
'ascii' codec can't decode byte 0xe9 in position 31: ordinal not in range(128)
It looks like an encoding problem. And it's always the same problem with every little manipulation I do with text. Can you help me with this ?
回答1:
This should do it:
with open('verbatim.csv') as csvfile: # No need to set mode to 'r', r is default
reader = unicode_csv_reader(csvfile, dialect=csv.excel)
for data in reader:
tokens = nltk.word_tokenize(unicode(data, 'utf-8'))
otherwise you can also try:
import codecs
with codecs.open('verbatim.csv', encoding='utf-8') as csvfile:
reader = unicode_csv_reader(csvfile, dialect=csv.excel)
for data in reader:
tokens = nltk.word_tokenize(data)
回答2:
First you have to understand that str and unicode are two different types.
There is a lot of documentation and great talks about the subject. I think this is one of the best: https://www.youtube.com/watch?v=sgHbC6udIqc
If you are going to work with text you should really understand the differences.
Overly simplified, str
is a sequence of bytes. unicode
is a sequence of "characters" (code points), to get a sequence of bytes to you encode
the unicode object with and encoding.
Yes, complicated. My suggestion, watch the video.
I'm not sure what your unicode_csv_reader
does but I'm guessing the problem is there as nltk
works with unicode. So I'm guessing that in unicode_csv_reader you are trying to encode/decode something with the wrong codec.
In [1]: import nltk
In [2]: nltk.word_tokenize(u'mi papá tiene 100 años')
Out[2]: [u'mi', u'pap\xe1', u'tiene', u'100', u'a\xf1os']
I would use the package unicodecsv
from pypi. Which returns a list of unicode objects for each line that you can pass to nltk.
import unicodecsv
with open('verbatim.csv', 'r') as csvfile:
reader = unicodecsv.reader(csvfile, dialect=csv.excel, encoding='iso-8859-1')
for data in reader:
tokens = nltk.word_tokenize(data)
you can provide and encoding to the reader, and there's no need to use codecs to open the file.
来源:https://stackoverflow.com/questions/36360111/using-nltk-tokenizer-with-utf8