Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

问题

I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object.

Currently, parse_thead does the following when the <thead> tag is present:

In BeautifulSoup, I get table objects with doc.find_all('table') and I can use table.find_all('thead')
In lxml, I get table objects with doc.xpath() on an xpath_expr on //table, and I can use table.xpath('.//thead')

and parse_tbody and parse_tfoot work in the same way. (I did not write this code and I am not experienced with either BS or lxml.) However, without a <thead>, parse_thead returns nothing and parse_tbody returns the header and the body together.

I append a wikitable instance below as an example. It lacks <thead> and <tbody>. Instead all rows, header or not, are enclosed in <tr>...</tr>, but header rows have <th> elements and body rows have <td> elements. Without <thead>, it seems like the right criterion for identifying the header is "from the start, put rows into the header until you find a row that has an element that's not <th>".

I'd appreciate suggestions on how I could write parse_thead and parse_tbody. Without much experience here, I would think I could either

Dive into the table object and manually insert thead and tbody tags before parsing it (this seems nice because then I wouldn't have to change any other code that recognizes tables with <thead>), or alternately
Change parse_thead and parse_tbody to recognize the table rows that have only <th> elements. (With either alternative, it seems like I really need to detect the head-body boundary in this way.)

I don't know how to do either of those things and I'd appreciate advice on both which alternative is more sensible and how I might go about it.

(Edit: Examples with no header rows and multiple header rows. I can't assume it has only one header row.)

<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>

回答1:

We can use <th> tags to detect headers, in case the table doesn't contain <thead> tags. If all columns of a row are <th> tags then we can assume that it is a header. Based on that I created a function that identifies the header and body.

Code for BeautifulSoup:

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.select('tr'): 
        if all(t.name == 'th' for t in tr.find_all(recursive=False)): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body

Code for lxml:

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.cssselect('tr'): 
        if all(t.tag == 'th' for t in tr.getchildren()): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body

The table parameter is either a Beautiful Soup Tag object or a lxml Element object. head_body is a dictionary that contains two lists of <tr> tags, the header and body rows.

Usage example:

html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)

print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}

回答2:

You should verify if the tr tag contains the th child you want, candidate.th returns None if there's no th inside candidate:

possibleHeaders = soup.find("table").findAll("tr")

Headers = []
for candidate in possibleHeaders:
    if candidate.th:
        Headers.append(candidate)

来源：https://stackoverflow.com/questions/45292001/detecting-header-in-html-tables-using-beautifulsoup-lxml-when-table-lacks-thea

标签

python

beautifulsoup

lxml