Pandas dataframe read_csv on bad data

后端未结

关注

 3  1939

庸人自扰 2020-12-02 22:13

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the pro

3条回答

清歌不尽 (楼主)

2020-12-02 22:19

To get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:

dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
                        warn_bad_lines=True, error_bad_lines=False)

error_bad_lines=False skips error-causing rows and warn_bad_lines=True prints error details and row number, like this:

'Skipping line 3: expected 4 fields, saw 3401\nSkipping line 4: expected 4 fields, saw 30...'

If you want to save the warning message (i.e. for some further processing), then you can save it to a file too (with use of contextlib):

import contextlib

with open(r'D:\Temp\log.txt', 'w') as log:
    with contextlib.redirect_stderr(log):
        dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', 
                                warn_bad_lines=True, error_bad_lines=False)

0 讨论(0)

查看其它3个回答