Is there any way for Pandas' read_csv C engine to ignore or replace Unicode parsing errors?

问题

Most questions around reading strings from disk in Python involve codec issues. In contrast, I have a CSV file that just flat out has garbage data in it. Here's how to create an example:

b = bytearray(b'a,b,c\n1,2,qwe\n10,-20,asdf')
b[10] = 0xff
b[11] = 0xff
with open('foo.csv', 'wb') as fid:
    fid.write(b)

Note that the second row, third column has two bytes, 0xFF, which don't represent any encoding, just a small amount of garbage data.

When I try to read this with pandas.read_csv:

import pandas as pd
df = pd.read_csv('foo.csv') # fails

I get an error, naturally:

  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  ...
  File "pandas/_libs/parsers.pyx", line 1520, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I can however successfully read this file if I use Pandas' Python CSV engine:

df2 = pd.read_csv('foo.csv', engine='python') # success

In this case, the invalid characters are replaced with U+EFBF characters that Unicode uses to represent "Invalid Character"s.

Question: is there any way for Pandas' C CSV engine to do the same thing as Python's here?

回答1:

The replacement of invalid characters you see with the python engine corresponds to the errors='replace' mode when encoding a bytes-like object.

You may read the csv using an arbitrary single-byte encoding and transcode columns with this error mode (passing a converter to read_csv or using series.str.encode/decode methods) but it's quite cumbersome since you have to identify a specific set of columns.

For a global effect, since read_csv does not support (yet) the errors parameter, you can pre-open the file with the python built-in open, which does support it.

df = pd.read_csv(open('foo.csv', errors='replace'))

来源：https://stackoverflow.com/questions/60311784/is-there-any-way-for-pandas-read-csv-c-engine-to-ignore-or-replace-unicode-pars

标签

python

pandas

csv

unicode