wrap the contents of a tag with BeautifulSoup

旧城冷巷雨未停 提交于 2019-12-11 23:54:03

问题


I'm tring to wrap the contents of a tag with BeautifulSoup. This:

<div class="footnotes">
    <p>Footnote 1</p>
    <p>Footnote 2</p>
</div>

should become this:

<div class="footnotes">
  <ol>
    <p>Footnote 1</p>
    <p>Footnote 2</p>
  </ol>
</div>

So I use the following code:

footnotes = soup.findAll("div", { "class" : "footnotes" })
footnotes_contents = ''
new_ol = soup.new_tag("ol") 
for content in footnotes[0].children:
    new_tag = soup.new_tag(content)
    new_ol.append(new_tag)

footnotes[0].clear()
footnotes[0].append(new_ol)

print footnotes[0]

but I get the following:

<div class="footnotes"><ol><
    ></
    ><<p>Footnote 1</p>></<p>Footnote 1</p>><
    ></
    ><<p>Footnote 2</p>></<p>Footnote 2</p>><
></
></ol></div>

Suggestions?


回答1:


Using lxml:

import lxml.html as LH
import lxml.builder as builder
E = builder.E

doc = LH.parse('data')
footnote = doc.find('//div[@class="footnotes"]')
ol = E.ol()
for tag in footnote:
    ol.append(tag)
footnote.append(ol)
print(LH.tostring(doc.getroot()))

prints

<html><body><div class="footnotes">
    <ol><p>Footnote 1</p>
    <p>Footnote 2</p>
</ol></div></body></html>

Note that with lxml, an Element (tag) can be in only one place in the tree (since every Element has only one parent), so appending tag to ol also removes it from footnote. So unlike with BeautifulSoup, you do not need to iterate over the contents in reverse order, nor use insert(0,...). You just append in order.


Using BeautifulSoup:

import bs4 as bs
with open('data', 'r') as f:
    soup = bs.BeautifulSoup(f)

footnote = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnote.contents):
    new_ol.insert(0, content.extract())

footnote.append(new_ol)
print(soup)

prints

<html><body><div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div></body></html>



回答2:


Just move the .contents of your tag over using tag.extract(); don't try to create them anew with soup.new_tag (which only takes a tag name, not a whole tag object). Don't call .clear() on the original tag; .extract() already removed the elements.

Move items over in reverse as the contents are being modified in-place, leading to skipped elements if you don't watch out.

Finally, use .find() when you only need to do this for one tag.

You do need to create a copy of the contents list, as it'll be modified in place

footnotes = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnotes.contents):
    new_ol.insert(0, content.extract())

footnotes.append(new_ol)

Demo:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div class="footnotes">
...     <p>Footnote 1</p>
...     <p>Footnote 2</p>
... </div>
... ''')
>>> footnotes = soup.find("div", { "class" : "footnotes" })
>>> new_ol = soup.new_tag("ol")
>>> for content in reversed(footnotes.contents):
...     new_ol.insert(0, content.extract())
... 
>>> footnotes.append(new_ol)
>>> print footnotes
<div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div>


来源:https://stackoverflow.com/questions/22632355/wrap-the-contents-of-a-tag-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!