Using BeautifulSoup to grab all the HTML between two tags

问题

I have some HTML that looks like this:

<h1>Title</h1>

//a random amount of p/uls or tagless text

<h1> Next Title</h1>

I want to copy all of the HTML from the first h1, to the next h1. How can I do this?

回答1:

This is the clear BeautifulSoup way, when the second h1 tag is a sibling of the first:

html = u""
for tag in soup.find("h1").next_siblings:
    if tag.name == "h1":
        break
    else:
        html += unicode(tag)

回答2:

I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.

Example:

m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)

回答3:

Interesting question. There is no way you can use just DOM to select it. You'll have to loop trough all elements preceding the first h1 (including) and put them into intro = str(intro), then get everything up to the 2nd h1 into chapter1. Than remove the intro from the chapter1 using

chapter = chapter1.replace(intro, '')

回答4:

Here is a complete, up-to-date solution:

Contents of temp.html:

<h1>Title</h1>
<p>hi</p>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>

Code:

import copy

from bs4 import BeautifulSoup

with open("resources/temp.html") as file_in:
    soup = BeautifulSoup(file_in, "lxml")

print(f"Before:\n{soup.prettify()}")

first_header = soup.find("body").find("h1")

siblings_to_add = []

for curr_sibling in first_header.next_siblings:
    if curr_sibling.name == "h1":
        for curr_sibling_to_add in siblings_to_add:
            curr_sibling.insert_after(curr_sibling_to_add)
        break
    else:
        siblings_to_add.append(copy.copy(curr_sibling))

print(f"\nAfter:\n{soup.prettify()}")

Output:

Before:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
 </body>
</html>

After:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
  //a random amount of p/uls or tagless text
  <p>
   hi
  </p>
 </body>
</html>

来源：https://stackoverflow.com/questions/4456679/using-beautifulsoup-to-grab-all-the-html-between-two-tags

标签

python

html

beautifulsoup