What should I do when has rowspan

前端 未结 4 1280
忘掉有多难
忘掉有多难 2020-12-18 06:17

If the row has rowspan element , how to make the row correspond to the table as in wikipedia page.

from bs4 import BeautifulSoup
import urllib2
from lxm         


        
4条回答
  •  执笔经年
    2020-12-18 06:49

    None of the parsers found across stackoverflow or across the web worked for me - they all parsed my tables from Wikipedia incorrectly. So here you go, a parser that actually works and is simple. Cheers.

    Define the parser functions:

    def pre_process_table(table):
        """
        INPUT:
            1. table - a bs4 element that contains the desired table: ie  ... 
    OUTPUT: a tuple of: 1. rows - a list of table rows ie: list of ... elements 2. num_rows - number of rows in the table 3. num_cols - number of columns in the table Options: include_td_head_count - whether to use only th or th and td to count number of columns (default: False) """ rows = [x for x in table.find_all('tr')] num_rows = len(rows) # get an initial column count. Most often, this will be accurate num_cols = max([len(x.find_all(['th','td'])) for x in rows]) # sometimes, the tables also contain multi-colspan headers. This accounts for that: header_rows_set = [x.find_all(['th', 'td']) for x in rows if len(x.find_all(['th', 'td']))>num_cols/2] num_cols_set = [] for header_rows in header_rows_set: num_cols = 0 for cell in header_rows: row_span, col_span = get_spans(cell) num_cols+=len([cell.getText()]*col_span) num_cols_set.append(num_cols) num_cols = max(num_cols_set) return (rows, num_rows, num_cols) def get_spans(cell): """ INPUT: 1. cell - a ... or ... element that contains a table cell entry OUTPUT: 1. a tuple with the cell's row and col spans """ if cell.has_attr('rowspan'): rep_row = int(cell.attrs['rowspan']) else: # ~cell.has_attr('rowspan'): rep_row = 1 if cell.has_attr('colspan'): rep_col = int(cell.attrs['colspan']) else: # ~cell.has_attr('colspan'): rep_col = 1 return (rep_row, rep_col) def process_rows(rows, num_rows, num_cols): """ INPUT: 1. rows - a list of table rows ie ... elements OUTPUT: 1. data - a Pandas dataframe with the html data in it """ data = pd.DataFrame(np.ones((num_rows, num_cols))*np.nan) for i, row in enumerate(rows): try: col_stat = data.iloc[i,:][data.iloc[i,:].isnull()].index[0] except IndexError: print(i, row) for j, cell in enumerate(row.find_all(['td', 'th'])): rep_row, rep_col = get_spans(cell) #print("cols {0} to {1} with rep_col={2}".format(col_stat, col_stat+rep_col, rep_col)) #print("\trows {0} to {1} with rep_row={2}".format(i, i+rep_row, rep_row)) #find first non-na col and fill that one while any(data.iloc[i,col_stat:col_stat+rep_col].notnull()): col_stat+=1 data.iloc[i:i+rep_row,col_stat:col_stat+rep_col] = cell.getText() if col_stat

    Here's an example of how one would use the above code on this Wisconsin data. Suppose it's already in a bs4 soup then...

    ## Find tables on the page and locate the desired one:
    tables = soup.findAll("table", class_='wikitable')
    
    ## I want table 3 or the one that contains years 2000-2018
    table = tables[3]
    
    ## run the above functions to extract the data
    rows, num_rows, num_cols = pre_process_table(table)
    df = process_rows(rows, num_rows, num_cols)
    

    My parser above will accurately parse tables such as the ones here, while all others fail to recreate the tables at numerous points.

    In case of simple cases - simpler solution

    There may be a simpler solution to the above issue if it's a pretty well-formatted table with rowspan attributes. Pandas has a fairly robust read_html function that can parse the provided html tables and seems to handle rowspan fairly well(couldn't parse the Wisconsin stuff). fillna(method='ffill') can then populate the unpopulated rows. Note that this does not necessarily work across column spaces. Also note that cleanup will be necessary after.

    Consider the html code:

        s = """
    one two three
    "4"
    "55" "99"
    """

    In order to process it into the requested output, just do:

    In [16]: df = pd.read_html(s)[0]
    
    In [29]: df
    Out[29]:
          0     1      2
    0   one   two  three
    1   "4"   NaN    NaN
    2  "55"  "99"    NaN
    

    Then to fill the NAs,

    In [30]: df.fillna(method='ffill')
    Out[30]:
          0     1      2
    0   one   two  three
    1   "4"   two  three
    2  "55"  "99"  three
    

提交回复
热议问题