Pandas read_csv expects wrong number of columns, with ragged csv file

前端 未结 4 1865
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-08 14:34

I have a csv file that has a few hundred rows and 26 columns, but the last few columns only have a value in a few rows and they are towards the middle or end of the file. Wh

相关标签:
4条回答
  • 2020-12-08 15:21

    You can use names parameter. For example, if you have csv file like this:

    1,2,1
    2,3,4,2,3
    1,2,3,3
    1,2,3,4,5,6
    

    And try to read it, you'll receive and error

    >>> pd.read_csv(r'D:/Temp/tt.csv')
    Traceback (most recent call last):
    ...
    Expected 5 fields in line 4, saw 6
    

    But if you pass names parameters, you'll get result:

    >>> pd.read_csv(r'D:/Temp/tt.csv', names=list('abcdef'))
       a  b  c   d   e   f
    0  1  2  1 NaN NaN NaN
    1  2  3  4   2   3 NaN
    2  1  2  3   3 NaN NaN
    3  1  2  3   4   5   6
    

    Hope it helps.

    0 讨论(0)
  • 2020-12-08 15:21

    Suppose you have a file like this:

    a,b,c
    1,2,3
    1,2,3,4
    

    You could use csv.reader to clean the file first,

    lines=list(csv.reader(open('file.csv')))    
    header, values = lines[0], lines[1:]    
    data = {h:v for h,v in zip (header, zip(*values))}
    

    and get:

    {'a' : ('1','1'), 'b': ('2','2'), 'c': ('3', '3')}
    

    If you don't have header you could use:

    data = {h:v for h,v in zip (str(xrange(number_of_columns)), zip(*values))}
    

    and then you can convert dictionary to dataframe with

    import pandas as pd
    df = pd.DataFrame.from_dict(data)
    
    0 讨论(0)
  • 2020-12-08 15:25

    The problem with the given solution is that you have to know the max number of columns required. I couldn't find a direct function for this problem, but you can surely write a def which can:

    1. read all the lines
    2. split it
    3. count the number of words/elements in each row
    4. store the max number of words/elements
    5. place that max value in the names option (as suggested by Roman Pekar)

    Here is the def (function) I wrote for my files:

    def ragged_csv(filename):
        f=open(filename)
        max_n=0
        for line in f.readlines():
            words = len(line.split(' '))
            if words > max_n:
                max_n=words
        lines=pd.read_csv(filename,sep=' ',names=range(max_n))
        return lines
    
    0 讨论(0)
  • 2020-12-08 15:29

    you can also load the CSV with separator '^', to load the entire string to a column, then use split to break the string into required delimiters. After that, you do a concat to merge with the original dataframe (if needed).

    temp=pd.read_csv('test.csv',sep='^',header=None,prefix='X')
    temp2=temp.X0.str.split(',',expand=True)
    del temp['X0']
    temp=pd.concat([temp,temp2],axis=1)
    
    0 讨论(0)
提交回复
热议问题