Read CSV into a dataFrame with varying row lengths using Pandas

前端 未结 6 1742
孤城傲影
孤城傲影 2020-12-03 22:32

So I have a CSV that looks a bit like this:

1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954         


        
相关标签:
6条回答
  • 2020-12-03 23:14

    Read fixed width should work:

    from io import StringIO
    
    s = '''1  01-01-2019  724
    2  01-01-2019  233  436
    3  01-01-2019  345
    4  01-01-2019  803  933  943  923  954
    5  01-01-2019  454'''
    
    
    pd.read_fwf(StringIO(s), header=None)
    
       0           1    2      3      4      5      6
    0  1  01-01-2019  724    NaN    NaN    NaN    NaN
    1  2  01-01-2019  233  436.0    NaN    NaN    NaN
    2  3  01-01-2019  345    NaN    NaN    NaN    NaN
    3  4  01-01-2019  803  933.0  943.0  923.0  954.0
    4  5  01-01-2019  454    NaN    NaN    NaN    NaN
    

    or with a delimiter param

    s = '''1 | 01-01-2019 | 724
    2 | 01-01-2019 | 233 | 436
    3 | 01-01-2019 | 345
    4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
    5 | 01-01-2019 | 454'''
    
    
    pd.read_fwf(StringIO(s), header=None, delimiter='|')
    
       0             1    2      3      4      5      6
    0  1   01-01-2019   724    NaN    NaN    NaN    NaN
    1  2   01-01-2019   233  436.0    NaN    NaN    NaN
    2  3   01-01-2019   345    NaN    NaN    NaN    NaN
    3  4   01-01-2019   803  933.0  943.0  923.0  954.0
    4  5   01-01-2019   454    NaN    NaN    NaN    NaN
    

    note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)

    0 讨论(0)
  • 2020-12-03 23:16

    Consider using Python csv to do the lifting for importing data and format grooming. You can implement a custom dialect to handle varying csv-ness.

    import csv
    import pandas as pd
    
    csv_data = """1 | 01-01-2019 | 724
    2 | 01-01-2019 | 233 | 436
    3 | 01-01-2019 | 345
    4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
    5 | 01-01-2019 | 454"""
    
    with open('test1.csv', 'w') as f:
        f.write(csv_data)
    
    csv.register_dialect('PipeDialect', delimiter='|')
    with open('test1.csv') as csvfile:
        data = [row for row in csv.reader(csvfile, 'PipeDialect')]
    df = pd.DataFrame(data = data)
    

    Gives you a csv import dialect and the following DataFrame:

        0             1      2      3      4      5     6
    0  1    01-01-2019     724   None   None   None  None
    1  2    01-01-2019    233     436   None   None  None
    2  3    01-01-2019     345   None   None   None  None
    3  4    01-01-2019    803    933    943    923    954
    4  5    01-01-2019     454   None   None   None  None
    

    Left as an exercise is handling the whitespace padding in the input file.

    0 讨论(0)
  • 2020-12-03 23:23

    If using only pandas, read in lines, deal with the separator after.

    import pandas as pd
    
    df = pd.read_csv('data.csv', header=None, sep='\n')
    df = df[0].str.split('\s\|\s', expand=True)
    
       0           1    2     3     4     5     6
    0  1  01-01-2019  724  None  None  None  None
    1  2  01-01-2019  233   436  None  None  None
    2  3  01-01-2019  345  None  None  None  None
    3  4  01-01-2019  803   933   943   923   954
    4  5  01-01-2019  454  None  None  None  None
    
    0 讨论(0)
  • 2020-12-03 23:25

    add extra columns (empty or otherwise) to the top of your csv file. Pandas will takes the first row as the default size, and anything below it will have NaN values. Example:

    file.csv:

    a,b,c,d,e
    1,2,3
    3
    2,3,4
    

    code:

    >>> import pandas as pd
    >>> pd.read_csv('file.csv')
       a    b    c   d   e
    0  1  2.0  3.0 NaN NaN
    1  3  NaN  NaN NaN NaN
    2  2  3.0  4.0 NaN NaN
    
    0 讨论(0)
  • 2020-12-03 23:28

    If you know that the data contains N columns, you can tell Pandas in advance how many columns to expect via the names parameter:

    import pandas as pd
    df = pd.read_csv('data', delimiter='|', names=list(range(7)))
    print(df)
    

    yields

       0             1    2      3      4      5      6
    0  1   01-01-2019   724    NaN    NaN    NaN    NaN
    1  2   01-01-2019   233  436.0    NaN    NaN    NaN
    2  3   01-01-2019   345    NaN    NaN    NaN    NaN
    3  4   01-01-2019   803  933.0  943.0  923.0  954.0
    4  5   01-01-2019   454    NaN    NaN    NaN    NaN
    

    If you have an the upper limit, N, on the number of columns, then you can have Pandas read N columns and then use dropna to drop completely empty columns:

    import pandas as pd
    df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
    print(df)
    

    yields

       0             1    2      3      4      5      6
    0  1   01-01-2019   724    NaN    NaN    NaN    NaN
    1  2   01-01-2019   233  436.0    NaN    NaN    NaN
    2  3   01-01-2019   345    NaN    NaN    NaN    NaN
    3  4   01-01-2019   803  933.0  943.0  923.0  954.0
    4  5   01-01-2019   454    NaN    NaN    NaN    NaN
    

    Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.

    0 讨论(0)
  • 2020-12-03 23:29
    colnames= [str(i) for i in range(9)]
    df = pd.read_table('data.csv', header=None, sep=',', names=colnames)
    

    Change 9 in colnames to number x if code gives the error

    Skipping line 17467: expected 3 fields, saw x
    
    0 讨论(0)
提交回复
热议问题