How to insert unescaped html fragment in Beautiful Soup 4

我只是一个虾纸丫 提交于 2020-05-12 04:57:28

问题


I have to parse some nasty government created html (http://www.spokanecounty.org/detentionservices/inmateroster/detail2.aspx?sysid=84060) and to ease my pain I would like to insert some html fragments into the document to wrap some content into more easily digested chunks.

BS4, however, escapes the html string fragment I'm trying to insert (<div class="case">) and turns it into this:

&lt;div class="case"&gt;

The relevant html I'm parsing is this:

<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
    &nbsp;
</div>
<div style='width:45%; float:left;'>
    <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
    <div>Added: 10/22/2012</div>
</div>
<div style='width:100%;clear:both;'>
    <b>Case Bond:</b> $2,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121018261' style='width:100%;'>
    <tr><td><b>Charge 1  <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
    <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table>

<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
    &nbsp;
</div>
<div style='width:45%; float:left;'>
    <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
    <div>Added: 10/21/2012</div>
</div>
<div style='width:100%;clear:both;'>
    <b>Case Bond:</b> $150,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121037010' style='width:100%;'>
    <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table>

The Python code looks like this:

case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for c in case_top:
    c.insert_before(soup.new_string('<div class="case">'))
case_bottom = soup.find_all("table", class_="bookinfo")
for c in case_bottom:
    c.insert_after(soup.new_string('</div'))

The results look like this:

&lt;div class="case"&gt;<div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/22/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> $2,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121018261" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr><tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr></table>&lt;/div&gt;&lt;div class="case"&gt;<div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/21/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> $150,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121037010" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE)<br/><b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr></table>&lt;/div&gt;

The question is then, how can I insert an unescaped html fragment into the document?


回答1:


You are telling BeautifulSoup to insert string data:

c.insert_before(soup.new_string('<div class="case">'))

Anything not safe for HTML string data will then indeed be escaped. You instead want to insert a tag object:

c.insert_before(soup.new_tag('div', **{'class': 'case'}))

This creates a new child element, which does not actually wrap anything.

If you wanted to wrap each individual element in that are, you'd use the Element.wrap() method:

c.wrap(soup.new_tag('div', **{'class': 'case'}))

but this only works on one tag at a time.

For wrapping a series of tags, the only thing that'll do is moving the tags over; inserting tags that were located in one place into another effectively moves them over:

case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for case in case_top:
    wrapper = soup.new_tag('div', **{'class': 'case'})
    case.insert_before(wrapper)
    while wrapper.next_sibling:
        wrapper.append(wrapper.next_sibling)
        if wrapper.find('table', class_='bookinfo'):
            # moved over the bookinfo table, time to stop
            break

This then moves everything from the case_top element all the way to the <table class="bookinfo"> element into the new <div class="case"> element.

Demo:

>>> from bs4 import BeautifulSoup
>>> import re
>>> sample = '''\
... <body>
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...     &nbsp;
... </div>
... <div style='width:45%; float:left;'>
...     <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
...     <div>Added: 10/22/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
...     <b>Case Bond:</b> $2,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121018261' style='width:100%;'>
...     <tr><td><b>Charge 1  <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
...     <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... </table>
... 
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...     &nbsp;
... </div>
... <div style='width:45%; float:left;'>
...     <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
...     <div>Added: 10/21/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
...     <b>Case Bond:</b> $150,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121037010' style='width:100%;'>
...     <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
... </table>
... </body>
... '''
>>> soup = BeautifulSoup(sample)
>>> case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
>>> for case in case_top:
...     wrapper = soup.new_tag('div', **{'class': 'case'})
...     case.insert_before(wrapper)
...     while wrapper.next_sibling:
...         wrapper.append(wrapper.next_sibling)
...         if wrapper.find('table', class_='bookinfo'):
...             # moved over the bookinfo table, time to stop
...             break
... 
>>> soup.body
<body><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
     
</div>
<div style="width:45%; float:left;">
<h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/22/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> $2,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121018261" style="width:100%;">
<tr><td><b>Charge 1  <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br/> <b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423<b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table></div>
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
     
</div>
<div style="width:45%; float:left;">
<h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/21/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> $150,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121037010" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br/> <b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table></div>
</body>


来源:https://stackoverflow.com/questions/28776780/how-to-insert-unescaped-html-fragment-in-beautiful-soup-4

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!