Python pandas: read csv with multiple tables repeated preamble

后端未结

关注

 2  796

再見小時候 2021-01-13 14:34

Is there a pythonic way to figure out which rows in a CSV file contain headers and values and which rows contain trash and then get the headers/values rows into data frames?

2条回答

盖世英雄少女心 (楼主)

2021-01-13 15:20

Yes, there is a more pythonic way to do that based on Pandas, (this is a quick demonstration to answer the question)

import pandas as pd
from StringIO import StringIO

#define an example to showcase the solution
st = """blah blah here's a test and
here's some information  
you don't care about  
even a little bit  
header1, header2, header3  
1, 2, 3  
4, 5, 6  

oh you have another test  
here's some more garbage  
that's different than the last one  
this should make  
life interesting  
header1, header2, header3  
7, 8, 9  
10, 11, 12  
13, 14, 15""" 

# 1- read the data with pd.read_csv  
# 2- specify that you want to drop bad lines, error_bad_lines=False 
# 3- The header has to be the first row of the file. Since this is not the case, let's manually define it with names=[...] and header=None.    
data = pd.read_csv(StringIO(st), delimiter=",", names=["header1","header2", "header3"], error_bad_lines=False, header=None) 

# the trash will be loaded as follows 
# blah blah here's a test and       NaN         NaN
# let's drop these rows 
data = data.dropna()

# remove the rows which contain "header1","header2", "header3"
mask = data["header1"].str.contains('header*')
data = data[~mask]
print data

Now your dataFrame looks like this:

   header1 header2 header3
5        1       2     3  
6        4       5     6  
13       7       8     9  
14      10      11    12  
15      13      14      15

0 讨论(0)

查看其它2个回答