问题
Solution :
See answer, it was not encoded in CP1252 but in UTF-16 . Solution code is :
import pandas as pd
df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')
Also works with encoding='utf-16-le'
Update : output of the first 3 lines in bytes :
In : import itertools
...: print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))
Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']
I'm working with csv files whose raw form is :
The problem is that it has two features raising a problem together :
the first row is not the header
There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252
I'm using Python 3.X and pandas to deal with these files.
But when I try to read it with this code :
import pandas as pd
df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)
I get the following output (same with header=0):
In order to read the csv correctly, I need to :
- get rid of the accent
- and ignore / delete the first row (which I don't need anyway).
How can I achieve that ?
PS : I know I could make a VBA program or something for this, but I'd rather not. I'm interested in including it in my Python program, or in knowing for sure that it is not possible.
回答1:
CP1252 is the plain old Latin codepage, which does support all Western European accents. There wouldn't be any garbled characters if the file was written in that codepage.
The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252.
Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.
The two strange characters at the start look like a Byte Order Mark. If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE.
Try using utf-16 or utf-16-le instead of cp1252
来源:https://stackoverflow.com/questions/56967744/python-pandas-how-to-read-a-csv-in-cp1252-with-a-first-row-to-delete