问题

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte

I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?

thank you!

EDIT1: Here is the code i convert to text to csv:

import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
    "product/productId",
    "review/userId",
    "review/profileName",
    "review/helpfulness",
    "review/score",
    "review/time",
    "review/summary",
    "review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")

outfile = open(OUTPUT_FILE_NAME,"w")

outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:

   line = line.strip()  
   #need to reomve the , so that the comment review text won't be in many columns
   line = line.replace(',','')

   if line == "":
      outfile.write(",".join(currentLine))
      outfile.write("\n")
      currentLine = []
      continue
   parts = line.split(":",1)
   currentLine.append(parts[1])

if currentLine != []:
    outfile.write(",".join(currentLine))
f.close()
outfile.close()

EDIT2:

Thanks to all of you trying to helping me out. So I have solved it by modify the output format in my code:

 outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")

回答1:

If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...

You have basically 2 ways to deal with decode errors:

use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)

For example:

f = open(INPUT_FILE_NAME,encoding="latin9")

f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')

回答2:

If you are using python3, it provides inbuilt support for unicode content -

f = open('file.csv', encoding="utf-8")

If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content

def remove_unicode(string_data):
    """ (str|unicode) -> (str|unicode)

    recovers ascii content from string_data
    """
    if string_data is None:
        return string_data

    if isinstance(string_data, bytes):
        string_data = bytes(string_data.decode('ascii', 'ignore'))
    else:
        string_data = string_data.encode('ascii', 'ignore')

    remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')

    return remove_ctrl_chars_regex.sub('', string_data)

with open('file.csv', 'r+', encoding="utf-8") as csv_file:
     content = remove_unicode(csv_file.read())
     csv_file.seek(0)
     csv_file.write(content)

Now you can read it without any unicode data issues.

来源：https://stackoverflow.com/questions/32733615/how-to-remove-non-utf-8-code-and-save-as-a-csv-file-python

标签

python

encoding

utf-8

how to remove non utf 8 code and save as a csv file python

问题

EDIT2:

回答1:

回答2: