By what library and how can I scrape texts on an HTML by its heading and paragraph tags?

久未见 提交于 2020-01-06 06:52:11

问题


My input will be any web documents that has no fixed HTML structure. What I want to do is to extract the texts in the heading (might be nested) and its following paragraph tags (might be multiple), and output them as pairs.

A simple HTML example can be:

<h1>House rule</h1>
<h2>Rule 1</h2>
<p>A</p>
<p>B</p>
<h2>Rule 2</h2>
<h3>Rule 2.1</h3>
<p>C</p>
<h3>Rule 2.2</h3>
<p>D</p>

For this example, I would like to have a output of pairs:

Rule 2.2, D

Rule 2.1, C

Rule 2, D

Rule 2, C

House rule, D

House rule, C

Rule 1, A B

.....and so on.

I am a beginner of Python, and I know the web scraping is widely done by Scrapy and BeautifulSoup, and it might require something to do with the XPath or code to identify sibling tags in this case. As how to extract the output pairs of the heading and its below paragraphs are obviously based on relative sequence of the tags. I am not sure which library will be better to use in this case, and it will be really helpful if you can show me how to achieve it. Thanks!


回答1:


Traversing the tree and collecting all the <p> tags that are under increasing levels of <h> tags can be done with BeautifulSoup:

html = '''
<h1>House rule</h1>
    <h2>Rule 1</h2>
        <p>A</p>
        <p>B</p>
    <h2>Rule 2</h2>
        <h3>Rule 2.1</h3>
            <p>C</p>
        <h3>Rule 2.2</h3>
            <p>D</p>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")

counter = 1
all_leafs = []
while True:
    htag = 'h%d'%counter
    hgroups =  soup.findAll(htag)
    print(htag,len(hgroups))
    counter += 1
    if len(hgroups) == 0: 
        break
    for hgroup in hgroups:
        for c,descendant in enumerate(hgroup.find_all_next()):
            name = getattr(descendant, "name", None)
            if name == 'p':
                all_leafs.append((hgroup.getText(),descendant.getText()))
print(all_leafs)

...

h1 1
h2 2
h3 2
h4 0
[('House rule', 'A'), ('House rule', 'B'), ('House rule', 'C'), ('House rule', 'D'), ('Rule 1', 'A'), ('Rule 1', 'B'), ('Rule 1', 'C'), ('Rule 1', 'D'), ('Rule 2', 'C'), ('Rule 2', 'D'), ('Rule 2.1', 'C'), ('Rule 2.1', 'D'), ('Rule 2.2', 'D')]


来源:https://stackoverflow.com/questions/51571609/by-what-library-and-how-can-i-scrape-texts-on-an-html-by-its-heading-and-paragra

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!