wrap the contents of a tag with BeautifulSoup

问题

I'm tring to wrap the contents of a tag with BeautifulSoup. This:

<div class="footnotes">
    <p>Footnote 1</p>
    <p>Footnote 2</p>
</div>

should become this:

<div class="footnotes">
  <ol>
    <p>Footnote 1</p>
    <p>Footnote 2</p>
  </ol>
</div>

So I use the following code:

footnotes = soup.findAll("div", { "class" : "footnotes" })
footnotes_contents = ''
new_ol = soup.new_tag("ol") 
for content in footnotes[0].children:
    new_tag = soup.new_tag(content)
    new_ol.append(new_tag)

footnotes[0].clear()
footnotes[0].append(new_ol)

print footnotes[0]

but I get the following:

<div class="footnotes"><ol><
    ></
    ><<p>Footnote 1</p>></<p>Footnote 1</p>><
    ></
    ><<p>Footnote 2</p>></<p>Footnote 2</p>><
></
></ol></div>

Suggestions?

回答1:

Using lxml:

import lxml.html as LH
import lxml.builder as builder
E = builder.E

doc = LH.parse('data')
footnote = doc.find('//div[@class="footnotes"]')
ol = E.ol()
for tag in footnote:
    ol.append(tag)
footnote.append(ol)
print(LH.tostring(doc.getroot()))

prints

<html><body><div class="footnotes">
    <ol><p>Footnote 1</p>
    <p>Footnote 2</p>
</ol></div></body></html>

Note that with lxml, an Element (tag) can be in only one place in the tree (since every Element has only one parent), so appending tag to ol also removes it from footnote. So unlike with BeautifulSoup, you do not need to iterate over the contents in reverse order, nor use insert(0,...). You just append in order.

Using BeautifulSoup:

import bs4 as bs
with open('data', 'r') as f:
    soup = bs.BeautifulSoup(f)

footnote = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnote.contents):
    new_ol.insert(0, content.extract())

footnote.append(new_ol)
print(soup)

prints

<html><body><div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div></body></html>

回答2:

Just move the .contents of your tag over using tag.extract(); don't try to create them anew with soup.new_tag (which only takes a tag name, not a whole tag object). Don't call .clear() on the original tag; .extract() already removed the elements.

Move items over in reverse as the contents are being modified in-place, leading to skipped elements if you don't watch out.

Finally, use .find() when you only need to do this for one tag.

You do need to create a copy of the contents list, as it'll be modified in place

footnotes = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnotes.contents):
    new_ol.insert(0, content.extract())

footnotes.append(new_ol)

Demo:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div class="footnotes">
...     <p>Footnote 1</p>
...     <p>Footnote 2</p>
... </div>
... ''')
>>> footnotes = soup.find("div", { "class" : "footnotes" })
>>> new_ol = soup.new_tag("ol")
>>> for content in reversed(footnotes.contents):
...     new_ol.insert(0, content.extract())
... 
>>> footnotes.append(new_ol)
>>> print footnotes
<div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div>

来源：https://stackoverflow.com/questions/22632355/wrap-the-contents-of-a-tag-with-beautifulsoup

标签

python

beautifulsoup

lxml