Pandas dataframe read_csv on bad data

后端 未结 3 1939
庸人自扰
庸人自扰 2020-12-02 22:13

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the pro

3条回答
  •  清歌不尽
    2020-12-02 22:19

    To get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:

    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
                            warn_bad_lines=True, error_bad_lines=False)
    

    error_bad_lines=False skips error-causing rows and warn_bad_lines=True prints error details and row number, like this:

    'Skipping line 3: expected 4 fields, saw 3401\nSkipping line 4: expected 4 fields, saw 30...'
    

    If you want to save the warning message (i.e. for some further processing), then you can save it to a file too (with use of contextlib):

    import contextlib
    
    with open(r'D:\Temp\log.txt', 'w') as log:
        with contextlib.redirect_stderr(log):
            dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', 
                                    warn_bad_lines=True, error_bad_lines=False)
    

提交回复
热议问题