问题
I have to parse some nasty government created html (http://www.spokanecounty.org/detentionservices/inmateroster/detail2.aspx?sysid=84060) and to ease my pain I would like to insert some html fragments into the document to wrap some content into more easily digested chunks.
BS4, however, escapes the html string fragment I'm trying to insert (<div class="case">) and turns it into this:
<div class="case">
The relevant html I'm parsing is this:
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/22/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> $2,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121018261' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table>
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/21/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> $150,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121037010' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table>
The Python code looks like this:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for c in case_top:
c.insert_before(soup.new_string('<div class="case">'))
case_bottom = soup.find_all("table", class_="bookinfo")
for c in case_bottom:
c.insert_after(soup.new_string('</div'))
The results look like this:
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/22/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> $2,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121018261" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr><tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr></table></div><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/21/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> $150,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121037010" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE)<br/><b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr></table></div>
The question is then, how can I insert an unescaped html fragment into the document?
回答1:
You are telling BeautifulSoup to insert string data:
c.insert_before(soup.new_string('<div class="case">'))
Anything not safe for HTML string data will then indeed be escaped. You instead want to insert a tag object:
c.insert_before(soup.new_tag('div', **{'class': 'case'}))
This creates a new child element, which does not actually wrap anything.
If you wanted to wrap each individual element in that are, you'd use the Element.wrap() method:
c.wrap(soup.new_tag('div', **{'class': 'case'}))
but this only works on one tag at a time.
For wrapping a series of tags, the only thing that'll do is moving the tags over; inserting tags that were located in one place into another effectively moves them over:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for case in case_top:
wrapper = soup.new_tag('div', **{'class': 'case'})
case.insert_before(wrapper)
while wrapper.next_sibling:
wrapper.append(wrapper.next_sibling)
if wrapper.find('table', class_='bookinfo'):
# moved over the bookinfo table, time to stop
break
This then moves everything from the case_top element all the way to the <table class="bookinfo"> element into the new <div class="case"> element.
Demo:
>>> from bs4 import BeautifulSoup
>>> import re
>>> sample = '''\
... <body>
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/22/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> $2,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121018261' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... </table>
...
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/21/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> $150,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121037010' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
... </table>
... </body>
... '''
>>> soup = BeautifulSoup(sample)
>>> case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
>>> for case in case_top:
... wrapper = soup.new_tag('div', **{'class': 'case'})
... case.insert_before(wrapper)
... while wrapper.next_sibling:
... wrapper.append(wrapper.next_sibling)
... if wrapper.find('table', class_='bookinfo'):
... # moved over the bookinfo table, time to stop
... break
...
>>> soup.body
<body><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/22/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> $2,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121018261" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br/> <b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423<b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table></div>
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/21/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> $150,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121037010" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br/> <b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table></div>
</body>
来源:https://stackoverflow.com/questions/28776780/how-to-insert-unescaped-html-fragment-in-beautiful-soup-4