问题
My input will be any web documents that has no fixed HTML structure. What I want to do is to extract the texts in the heading (might be nested) and its following paragraph tags (might be multiple), and output them as pairs.
A simple HTML example can be:
<h1>House rule</h1>
<h2>Rule 1</h2>
<p>A</p>
<p>B</p>
<h2>Rule 2</h2>
<h3>Rule 2.1</h3>
<p>C</p>
<h3>Rule 2.2</h3>
<p>D</p>
For this example, I would like to have a output of pairs:
Rule 2.2, D
Rule 2.1, C
Rule 2, D
Rule 2, C
House rule, D
House rule, C
Rule 1, A B
.....and so on.
I am a beginner of Python, and I know the web scraping is widely done by Scrapy and BeautifulSoup, and it might require something to do with the XPath or code to identify sibling tags in this case. As how to extract the output pairs of the heading and its below paragraphs are obviously based on relative sequence of the tags. I am not sure which library will be better to use in this case, and it will be really helpful if you can show me how to achieve it. Thanks!
回答1:
Traversing the tree and collecting all the <p> tags that are under increasing levels of <h> tags can be done with BeautifulSoup:
html = '''
<h1>House rule</h1>
<h2>Rule 1</h2>
<p>A</p>
<p>B</p>
<h2>Rule 2</h2>
<h3>Rule 2.1</h3>
<p>C</p>
<h3>Rule 2.2</h3>
<p>D</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
counter = 1
all_leafs = []
while True:
htag = 'h%d'%counter
hgroups = soup.findAll(htag)
print(htag,len(hgroups))
counter += 1
if len(hgroups) == 0:
break
for hgroup in hgroups:
for c,descendant in enumerate(hgroup.find_all_next()):
name = getattr(descendant, "name", None)
if name == 'p':
all_leafs.append((hgroup.getText(),descendant.getText()))
print(all_leafs)
...
h1 1
h2 2
h3 2
h4 0
[('House rule', 'A'), ('House rule', 'B'), ('House rule', 'C'), ('House rule', 'D'), ('Rule 1', 'A'), ('Rule 1', 'B'), ('Rule 1', 'C'), ('Rule 1', 'D'), ('Rule 2', 'C'), ('Rule 2', 'D'), ('Rule 2.1', 'C'), ('Rule 2.1', 'D'), ('Rule 2.2', 'D')]
来源:https://stackoverflow.com/questions/51571609/by-what-library-and-how-can-i-scrape-texts-on-an-html-by-its-heading-and-paragra