beautifulsoup .get_text() is not specific enough for my HTML parsing

狂风中的少年 提交于 2019-12-23 09:16:25

问题


Given the HTML code below I want output just the text of the h1 but not the "Details about  ", which is the text of the span (which is encapsulated by the h1).

My current output gives:

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

I would like:

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Here is the HTML I am working with

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

Here is my current code:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

Note: I do not want to just truncate the string because I would like this code to have some re-usability. What would be best is some code that crops out any text that is bounded by the span.


回答1:


You can use extract() to remove all span tags:

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black



回答2:


One solution is to check if the string contains html:

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

Another solution (which I prefer) is to check for instance of bs4.element.Tag:

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content


来源:https://stackoverflow.com/questions/31462360/beautifulsoup-get-text-is-not-specific-enough-for-my-html-parsing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!