问题
Most questions around reading strings from disk in Python involve codec issues. In contrast, I have a CSV file that just flat out has garbage data in it. Here's how to create an example:
b = bytearray(b'a,b,c\n1,2,qwe\n10,-20,asdf')
b[10] = 0xff
b[11] = 0xff
with open('foo.csv', 'wb') as fid:
fid.write(b)
Note that the second row, third column has two bytes, 0xFF
, which don't represent any encoding, just a small amount of garbage data.
When I try to read this with pandas.read_csv:
import pandas as pd
df = pd.read_csv('foo.csv') # fails
I get an error, naturally:
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
...
File "pandas/_libs/parsers.pyx", line 1520, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I can however successfully read this file if I use Pandas' Python CSV engine:
df2 = pd.read_csv('foo.csv', engine='python') # success
In this case, the invalid characters are replaced with U+EFBF
characters that Unicode uses to represent "Invalid Character"s.
Question: is there any way for Pandas' C CSV engine to do the same thing as Python's here?
回答1:
The replacement of invalid characters you see with the python engine corresponds to the errors='replace'
mode when encoding a bytes-like object.
You may read the csv using an arbitrary single-byte encoding and transcode columns with this error mode (passing a converter to read_csv
or using series.str.encode/decode
methods) but it's quite cumbersome since you have to identify a specific set of columns.
For a global effect, since read_csv
does not support (yet) the errors
parameter, you can pre-open the file with the python built-in open
, which does support it.
df = pd.read_csv(open('foo.csv', errors='replace'))
来源:https://stackoverflow.com/questions/60311784/is-there-any-way-for-pandas-read-csv-c-engine-to-ignore-or-replace-unicode-pars