问题
How do I remove the superscripts from all of the text? I have code below that gets all visible text, but the superscripts for footnoting are messing things up. How do I remove them?
for example Active accounts (1),(2), (1),(2) are visible superscripts.
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
f_url='https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/exhibit991prq12018pypl.htm'
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = requests.get(f_url)
text= text_from_html(html.text)
回答1:
The BeautifulSoup function find_all returns a list of all single discrete HTML elements in the input (find_all is the proper function to use in BeautifulSoup 4 and preferred over findAll). The next function, filter, goes through this list and removes the items for which its callback routine returns False. The callback function tests the tag name of each snippet and returns False if it's in the not-wanted list, True otherwise.
If these superscripts are always indicated by the proper HTML tag sup then you can add it to the not-wanted list in the callback function.
Possible pitfalls are:
- It is assumed that the literal (semantically correct) tag
supis used, and not, for example, a class or a span that merely specifiesvertical-align: superscript;in its CSS; - It is assumed that you want to get rid of all elements that are in this superscript tag. If there are exceptions ("the 20th century"), you can check the text contents; for example, only remove if its contents are all numerical. If there are exceptions to that ("a2 = b2 + c2"), you will have to check for a wider context, or build a whitelist or blacklist of inclusions/exclusions.
来源:https://stackoverflow.com/questions/51115590/beautiful-soup-remove-superscripts