HTML Table to List Parsing - <TBODY> monkey wrench for both xml and lxml

一世执手 提交于 2020-01-16 18:44:06

问题


I read the answers to Parse HTML table to Python list? and tried to use the ideas to read/process my local html downloaded from a web site
(the files contain one table and start with the <table class="table"> label). I ran into problems due to the presence of two html tags.

With the <thead> label the parse doesn't pick up the header, and the <tbody> causes both xml and lxml to completely fail.

I tried googling for a solution but the answer most likely is embedded in some documentation somewhere for xml and/or lxml.

I'm just trying to plug into xml or lxml in the simplest way possible, but would be happy if the community here pointed the way to other 'stable/trusted' modules that might be more appropriate.

I realized I could edit the strings in python to remove the tags, but that is not too elegant, and I'm trying to learn new things.

Here is the stripped down sample code illustrating the problem:

#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Parse HTML table to list
#--------*---------*---------*---------*---------*---------*---------*---------*
import os, sys
from xml.etree import ElementTree as ET
from lxml import etree


#                  # this setting blows up

s     = """<table class="table">
<thead>
<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>
</thead>
<tbody>
<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>
</tbody>
</table>
"""

#                  # open this up for clear sailing
if False:
    s     = """<table class="table">

<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>


<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>

</table>
"""

s = s.replace('\n','')
print('0:\n'+s)

while True:
    table = ET.XML(s)
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('1:')
        print(values)
    break

while True:
    table = etree.HTML(s).find("body/table")
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('2:')
        print(values)
    break

sys.exit()

回答1:


While waiting for some help showing how to do this in a 'Pythonic way', I came up with an easy brute force method:

With the string s set to the 2nd option, with the given <thead> and <tbody> labels, apply the following code:

s = ''.join(s.split('<tbody>'))
s = ''.join(s.split('</tbody>'))
s = ''.join(s.split('<thead>'))
s = ''.join(s.split('</thead>'))


来源:https://stackoverflow.com/questions/49286753/html-table-to-list-parsing-tbody-monkey-wrench-for-both-xml-and-lxml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!