问题
I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables.
I have created some code which works and it gives expected output. But, I don't like this solution, because it destroys 'soup' object.
Do you know how to do it in more elegant way ?
from BeautifulSoup import BeautifulSoup as bs
input = '''<html><head><title>title</title></head>
<body>
<p>paragraph</p>
<div><div>
<table>table1<table>inner11<table>inner12</table></table></table>
<div><table>table2<table>inner2</table></table></div>
</div></div>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
</html>'''
soup = bs(input)
while(True):
t=soup.find("table")
if t is None:
break
print str(t)
t.decompose()
Output:
<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
回答1:
use soup.findAll("table")
instead of find()
and decompose()
:
tables = soup.findAll("table")
for table in tables:
if table.findParent("table") is None:
print str(table)
output :
<table>table1<table>inner11<table>inner12</table></table></table>
<table>table2<table>inner2</table></table>
<table>table3<table>inner3</table></table>
<table>table4<table>inner4</table></table>
and nothing gets destroyed/destructed.
来源:https://stackoverflow.com/questions/9783579/find-all-tables-in-html-using-beautifulsoup