Access to a specific table in html tag

*爱你&永不变心* 提交于 2019-12-06 16:28:15

问题


I am going to use beautifulsoup to find a table that defined in the “content logical definition” in the following links:

1) https://www.hl7.org/fhir/valueset-account-status.html
2) https://www.hl7.org/fhir/valueset-activity-reason.html
3) https://www.hl7.org/fhir/valueset-age-units.html 

Several tables may be defined in the pages. The table I want is located under <h2> tag with text “content logical definition”. Some of the pages may lack of any table in the “content logical definition” section, so I want the table to be null. By now I tried several solution, but each of them return wrong table for some of the pages.

The last solution that was offered by alecxe is this:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

This solution returns null if no table is located in the section of “content logical definition” but for the second url having table in “content logical definition” it returns wrong table, a table at the end of the page.
How can I edit this code to access a table defined exactly after tag having text of “content logical definition”, and if there is no table in this section it returns null.


回答1:


It looks like the problem with alecxe's code is that it returns a table that is a direct sibling of h2, but the one you want is actually within a div (which is h2's sibling). This worked for me:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-account-status.html',
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]


def extract_table(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == 'h2' and 'Content Logical Definition' in elm.text)
    div = h2.find_next_sibling('div')
    return div.find('table')


for url in urls:
    print extract_table(url)


来源:https://stackoverflow.com/questions/37552550/access-to-a-specific-table-in-html-tag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!