Read CSV into a dataFrame with varying row lengths using Pandas

问题

So I have a CSV that looks a bit like this:

1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...

And when I try to use the following code to generate a dataFrame..

df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)

It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)

The rest are considered 'bad lines' giving me the following error:

Skipping line 17467: expected 3 fields, saw 9

How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?

Thanks!

回答1:

If using only pandas, read in lines, deal with the separator after.

import pandas as pd

df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)

   0           1    2     3     4     5     6
0  1  01-01-2019  724  None  None  None  None
1  2  01-01-2019  233   436  None  None  None
2  3  01-01-2019  345  None  None  None  None
3  4  01-01-2019  803   933   943   923   954
4  5  01-01-2019  454  None  None  None  None

回答2:

If you know that the data contains N columns, you can tell Pandas in advance how many columns to expect via the names parameter:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)

yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

If you have an the upper limit, N, on the number of columns, then you can have Pandas read N columns and then use dropna to drop completely empty columns:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)

yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.

回答3:

add extra columns (empty or otherwise) to the top of your csv file. Pandas will takes the first row as the default size, and anything below it will have NaN values. Example:

file.csv:

a,b,c,d,e
1,2,3
3
2,3,4

code:

>>> import pandas as pd
>>> pd.read_csv('file.csv')
   a    b    c   d   e
0  1  2.0  3.0 NaN NaN
1  3  NaN  NaN NaN NaN
2  2  3.0  4.0 NaN NaN

回答4:

Read fixed width should work:

from io import StringIO

s = '''1  01-01-2019  724
2  01-01-2019  233  436
3  01-01-2019  345
4  01-01-2019  803  933  943  923  954
5  01-01-2019  454'''


pd.read_fwf(StringIO(s), header=None)

   0           1    2      3      4      5      6
0  1  01-01-2019  724    NaN    NaN    NaN    NaN
1  2  01-01-2019  233  436.0    NaN    NaN    NaN
2  3  01-01-2019  345    NaN    NaN    NaN    NaN
3  4  01-01-2019  803  933.0  943.0  923.0  954.0
4  5  01-01-2019  454    NaN    NaN    NaN    NaN

or with a delimiter param

s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''


pd.read_fwf(StringIO(s), header=None, delimiter='|')

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN

note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)

回答5:

Consider using Python csv to do the lifting for importing data and format grooming. You can implement a custom dialect to handle varying csv-ness.

import csv
import pandas as pd

csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""

with open('test1.csv', 'w') as f:
    f.write(csv_data)

csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
    data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)

Gives you a csv import dialect and the following DataFrame:

    0             1      2      3      4      5     6
0  1    01-01-2019     724   None   None   None  None
1  2    01-01-2019    233     436   None   None  None
2  3    01-01-2019     345   None   None   None  None
3  4    01-01-2019    803    933    943    923    954
4  5    01-01-2019     454   None   None   None  None

Left as an exercise is handling the whitespace padding in the input file.

回答6:

colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)

Change 9 in colnames to number x if code gives the error

Skipping line 17467: expected 3 fields, saw x

来源：https://stackoverflow.com/questions/55129640/read-csv-into-a-dataframe-with-varying-row-lengths-using-pandas

标签

python

pandas

csv

dataframe