Beautiful soup remove superscripts

强颜欢笑 提交于 2019-12-24 07:57:56

问题


How do I remove the superscripts from all of the text? I have code below that gets all visible text, but the superscripts for footnoting are messing things up. How do I remove them?

for example Active accounts (1),(2), (1),(2) are visible superscripts.

from bs4 import BeautifulSoup
from bs4.element import Comment
import requests


f_url='https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/exhibit991prq12018pypl.htm'

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = requests.get(f_url)
text= text_from_html(html.text)

回答1:


The BeautifulSoup function find_all returns a list of all single discrete HTML elements in the input (find_all is the proper function to use in BeautifulSoup 4 and preferred over findAll). The next function, filter, goes through this list and removes the items for which its callback routine returns False. The callback function tests the tag name of each snippet and returns False if it's in the not-wanted list, True otherwise.

If these superscripts are always indicated by the proper HTML tag sup then you can add it to the not-wanted list in the callback function.

Possible pitfalls are:

  1. It is assumed that the literal (semantically correct) tag sup is used, and not, for example, a class or a span that merely specifies vertical-align: superscript; in its CSS;
  2. It is assumed that you want to get rid of all elements that are in this superscript tag. If there are exceptions ("the 20th century"), you can check the text contents; for example, only remove if its contents are all numerical. If there are exceptions to that ("a2 = b2 + c2"), you will have to check for a wider context, or build a whitelist or blacklist of inclusions/exclusions.


来源:https://stackoverflow.com/questions/51115590/beautiful-soup-remove-superscripts

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!