Pandas seems to ignore first column name when reading tab-delimited data, gives KeyError

前端 未结 4 1520
[愿得一人]
[愿得一人] 2021-01-04 17:37

I am using pandas 0.12.0 in ipython3 on Ubuntu 13.10, in order to wrangle large tab-delimited datasets in txt files. Using read_table to create a DataFrame from the txt app

4条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-04 18:22

    This seems to be (related to) a known issue, see GH #4793. Using 'utf-8-sig' as the encoding seems to work. Without it, we have:

    >>> df = pd.read_table("datafile.txt")
    >>> df.columns
    Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
    >>> df.columns[0]
    '\xef\xbb\xbfRECORDING_SESSION_LABEL'
    

    but with it, we have

    >>> df = pd.read_table("datafile.txt", encoding="utf-8-sig")
    >>> df.columns
    Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
    >>> df.columns[0]
    u'RECORDING_SESSION_LABEL'
    >>> df["RECORDING_SESSION_LABEL"].max()
    u'73_1'
    

    (Used Python 2 for the above, but the same happens with Python 3.)

提交回复
热议问题