Python pandas: read csv with multiple tables repeated preamble

后端 未结 2 796
再見小時候
再見小時候 2021-01-13 14:34

Is there a pythonic way to figure out which rows in a CSV file contain headers and values and which rows contain trash and then get the headers/values rows into data frames?

2条回答
  •  盖世英雄少女心
    2021-01-13 15:20

    Yes, there is a more pythonic way to do that based on Pandas, (this is a quick demonstration to answer the question)

    import pandas as pd
    from StringIO import StringIO
    
    #define an example to showcase the solution
    st = """blah blah here's a test and
    here's some information  
    you don't care about  
    even a little bit  
    header1, header2, header3  
    1, 2, 3  
    4, 5, 6  
    
    oh you have another test  
    here's some more garbage  
    that's different than the last one  
    this should make  
    life interesting  
    header1, header2, header3  
    7, 8, 9  
    10, 11, 12  
    13, 14, 15""" 
    
    # 1- read the data with pd.read_csv  
    # 2- specify that you want to drop bad lines, error_bad_lines=False 
    # 3- The header has to be the first row of the file. Since this is not the case, let's manually define it with names=[...] and header=None.    
    data = pd.read_csv(StringIO(st), delimiter=",", names=["header1","header2", "header3"], error_bad_lines=False, header=None) 
    
    # the trash will be loaded as follows 
    # blah blah here's a test and       NaN         NaN
    # let's drop these rows 
    data = data.dropna()
    
    # remove the rows which contain "header1","header2", "header3"
    mask = data["header1"].str.contains('header*')
    data = data[~mask]
    print data 
    

    Now your dataFrame looks like this:

       header1 header2 header3
    5        1       2     3  
    6        4       5     6  
    13       7       8     9  
    14      10      11    12  
    15      13      14      15
    

提交回复
热议问题