Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

前端 未结 2 695
悲哀的现实
悲哀的现实 2020-12-19 10:03

I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the c

相关标签:
2条回答
  • 2020-12-19 10:24

    You've tricked the read_csv parser here (and to be fair, I don't think it can always be expected to output correctly no matter what you throw at it)... but yes, it could be a bug!

    As @Steven points out you can use the converters argument of read_csv:

    df2 = pd.read_csv('test', sep='\t', converters={'a': str})
    

    A lazy solution is just to patch this up after you've read in the file:

    In [11]: df2['a'] = df2['a'].astype('str')
    
    # now they are equal
    In [12]: pd.util.testing.assert_frame_equal(df, df2)
    

    Note: If you are looking for a solution to store DataFrames, e.g. between sessions, both pickle and HDF5Store are excellent solutions which won't be affected by these type of parsing bugs (and will be considerably faster). See: How to store data frame using PANDAS, Python

    0 讨论(0)
  • 2020-12-19 10:27

    To avoid having Pandas infer your data type, provide a converters argument to read_csv:

    converters : dict. optional

    Dict of functions for converting values in certain columns. Keys can either be integers or column labels

    For your file this would look like:

    df2 = pd.read_csv('test', sep='\t', converters={'a':str})
    

    My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.

    0 讨论(0)
提交回复
热议问题