UnicodeDecodeError on Python 2.7

问题

Having some problems. I'm doing a TwitterSentimentAnalysis on a dataset of length 1.6 million. Since my pc could not do the work (due to so many computations), the professor told me to use the university server.

I just realiazed that on the server, python version is 2.7 that it does not allow me to use the parameter encoding in csv reader for reading the file.

Anytime I got the UnicodeDecodeError, I have to manually remove the tweet from the dataset otherwise I can't do the rest. I have tried to go on all the question on the site but I resolved nothing.

I just want to skip the line who raises the error, since the set is big enough to allow me a good analysis.

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8", errors='ignore')

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8", errors='replace') for s in row]
    def __iter__(self):
        return self

def extraction(file, textCol, sentimentCol):
    "The function reads the tweets"
    #fp = open(file, "r",encoding="utf8")
    fp = open(file, "r")
    tweetreader = UnicodeReader(fp)
    #tweetreader = csv.reader( fp, delimiter=',', quotechar='"', escapechar='\\' )
    tweets = []
    for row in tweetreader:
        # It takes the column in which the tweets and the sentiment are
        if row[sentimentCol]=='positive' or row[sentimentCol]=='4':
            tweets.append([remove_stopwords(row[textCol]), 'positive']);
        else:
            if row[sentimentCol]=='negative' or row[sentimentCol]=='0':
                tweets.append([remove_stopwords(row[textCol]), 'negative']);
            else:
               if row[sentimentCol]=='irrilevant' or row[sentimentCol]=='2' or row[sentimentCol]=='neutral':
                   tweets.append([remove_stopwords(row[textCol]), 'neutral']);

    tweets = filterWords(tweets)
    fp.close()
    return tweets;

Error:

Traceback (most recent call last):
  File "sentimentAnalysis_v4.py", line 165, in <module>
    newTweets = extraction("sentiment2.csv",5,0)
  File "sentimentAnalysis_v4.py", line 47, in extraction
    for row in tweetreader:
  File "sentimentAnalysis_v4.py", line 29, in next
    row = self.reader.next()
  File "sentimentAnalysis_v4.py", line 19, in next
    return self.reader.next().encode("utf-8", errors='ignore')
  File "/usr/lib/python2.7/codecs.py", line 615, in next
    line = self.readline()
  File "/usr/lib/python2.7/codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "/usr/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd9 in position 48: invalid continuation byte

回答1:

If you have input data that is malformed, I'd not use codecs here to do the reading.

Use the newer io.open() function and specify a error handling strategy; 'replace' should do:

class ForgivingUTF8Recoder:
    def __init__(self, filename, encoding):
        self.reader = io.open(f, newline='', encoding=encoding, errors='replace')
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8", errors='ignore')

I set the newline handling to '' to make sure the CSV module gets to handle newlines in values correctly.

Instead of passing in an open file, just pass in the filename:

tweetreader = UnicodeReader(file)

This won't let you skip faulty lines, it instead will handle faulty lines by replacing characters that cannot be decoded with the U+FFFD REPLACEMENT CHARACTER; you can still look for those in your columns if you want to skip the whole row.

来源：https://stackoverflow.com/questions/28494869/unicodedecodeerror-on-python-2-7

标签

python

python-2.7

unicode

sentiment-analysis