Read CSV into a dataFrame with varying row lengths using Pandas

前端未结

关注

 6  1754

So I have a CSV that looks a bit like this:

1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  天命终不由人        
                
              
                            
                2020-12-03 23:14
              
            
            
                                                                       
Read fixed width should work:

from io import StringIO

s = '''1  01-01-2019  724
2  01-01-2019  233  436
3  01-01-2019  345
4  01-01-2019  803  933  943  923  954
5  01-01-2019  454'''


pd.read_fwf(StringIO(s), header=None)

   0           1    2      3      4      5      6
0  1  01-01-2019  724    NaN    NaN    NaN    NaN
1  2  01-01-2019  233  436.0    NaN    NaN    NaN
2  3  01-01-2019  345    NaN    NaN    NaN    NaN
3  4  01-01-2019  803  933.0  943.0  923.0  954.0
4  5  01-01-2019  454    NaN    NaN    NaN    NaN


or with a delimiter param

s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''


pd.read_fwf(StringIO(s), header=None, delimiter='|')

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN


note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-03 23:16
              
            
            
                                                                       
Consider using Python csv to do the lifting for importing data and format grooming.  You can implement a custom dialect to handle varying csv-ness.

import csv
import pandas as pd

csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""

with open('test1.csv', 'w') as f:
    f.write(csv_data)

csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
    data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)


Gives you a csv import dialect and the following DataFrame:

    0             1      2      3      4      5     6
0  1    01-01-2019     724   None   None   None  None
1  2    01-01-2019    233     436   None   None  None
2  3    01-01-2019     345   None   None   None  None
3  4    01-01-2019    803    933    943    923    954
4  5    01-01-2019     454   None   None   None  None


Left as an exercise is handling the whitespace padding in the input file.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  自闭症患者        
                
              
                            
                2020-12-03 23:23
              
            
            
                                                                       
If using only pandas, read in lines, deal with the separator after. 

import pandas as pd

df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)

   0           1    2     3     4     5     6
0  1  01-01-2019  724  None  None  None  None
1  2  01-01-2019  233   436  None  None  None
2  3  01-01-2019  345  None  None  None  None
3  4  01-01-2019  803   933   943   923   954
4  5  01-01-2019  454  None  None  None  None

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-12-03 23:25
              
            
            
                                                                       
add extra columns (empty or otherwise) to the top of your csv file. Pandas will takes the first row as the default size, and anything below it will have NaN values. Example:

file.csv:

a,b,c,d,e
1,2,3
3
2,3,4


code:

>>> import pandas as pd
>>> pd.read_csv('file.csv')
   a    b    c   d   e
0  1  2.0  3.0 NaN NaN
1  3  NaN  NaN NaN NaN
2  2  3.0  4.0 NaN NaN

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2020-12-03 23:28
              
            
            
                                                                       
If you know that the data contains N columns, you can 
tell Pandas in advance how many columns to expect via the names parameter:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)


yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN


If you have an the upper limit, N, on the number of columns, then you can
have Pandas read N columns and then use dropna to drop completely empty columns:

import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)


yields

   0             1    2      3      4      5      6
0  1   01-01-2019   724    NaN    NaN    NaN    NaN
1  2   01-01-2019   233  436.0    NaN    NaN    NaN
2  3   01-01-2019   345    NaN    NaN    NaN    NaN
3  4   01-01-2019   803  933.0  943.0  923.0  954.0
4  5   01-01-2019   454    NaN    NaN    NaN    NaN


Note that this could drop columns from the middle of the data set (not just
columns from the right-hand side) if they are completely empty.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2020-12-03 23:29
              
            
            
                                                                       
colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)


Change 9 in colnames to number x if code gives the error

Skipping line 17467: expected 3 fields, saw x

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复