Python/Pandas : how to read a csv in cp1252 with a first row to delete?

别来无恙 提交于 2020-06-23 09:45:24

问题


Solution :

See answer, it was not encoded in CP1252 but in UTF-16 . Solution code is :

import pandas as pd

df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')

Also works with encoding='utf-16-le'


Update : output of the first 3 lines in bytes :

In : import itertools 
...:  print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))

Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']

I'm working with csv files whose raw form is :

The problem is that it has two features raising a problem together :

  • the first row is not the header

  • There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252

I'm using Python 3.X and pandas to deal with these files.

But when I try to read it with this code :

import pandas as pd 

df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)

I get the following output (same with header=0):

In order to read the csv correctly, I need to :

  • get rid of the accent
  • and ignore / delete the first row (which I don't need anyway).

How can I achieve that ?

PS : I know I could make a VBA program or something for this, but I'd rather not. I'm interested in including it in my Python program, or in knowing for sure that it is not possible.


回答1:


CP1252 is the plain old Latin codepage, which does support all Western European accents. There wouldn't be any garbled characters if the file was written in that codepage.

The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252.

Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.

The two strange characters at the start look like a Byte Order Mark. If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE.

Try using utf-16 or utf-16-le instead of cp1252



来源:https://stackoverflow.com/questions/56967744/python-pandas-how-to-read-a-csv-in-cp1252-with-a-first-row-to-delete

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!