How to remove content in nested tags with BeautifulSoup?

狂风中的少年 提交于 2020-01-04 09:26:10

问题


How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

I have tried .text but it only removes the tags

>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something  blah blah something else'

Desired output:

Something something something else


回答1:


You can check for bs4.element.NavigableString on children:

from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
    for item in elem.children:
        if isinstance(item,bs4.element.NavigableString):
            yield item

print ''.join(get_only_text(bs(html).find_all('foo')[0]))

Output;

Something something  something  else



回答2:


Eg.

body = bs(html)
for tag in body.find_all('bar'):
    tag.replace_with('')



回答3:


Here is my simple method, soup.body.clear() or soup.tag.clear()

let's say you want to clear the content in <table></table> and add a new pandas dataframe; later you can use this clear method to easily update your tables in an html file for your webpage instead of flask/django:

    import pandas as pd
    import bs4

I want to convert a 1.2million row .csv into a DataFrame, then into a HTML table, and then add it to my webpage's html syntax. Later I want to easily update the data whenever the csv gets updated by simply switching a variable

    bizcsv = read_csv("business.csv")
    dframe = pd.DataFrame(bizcsv)
    dfhtml = dframe.to_html #convert DataFrame to table, HTML format
    dfhtml_update = dfhtml_html.strip('<table border="1" class="dataframe">, </table>')
    """use dfhtml_update later to update your table without the <table> tags,
    the <table> is easy for BS to select & clear!"""

    #A small function to unescape (&lt; to <) the tags back into HTML format
    def unescape(s):
        s = s.replace("&lt;", "<")
        s = s.replace("&gt;", ">")
        # this has to be last:
        s = s.replace("&amp;", "&")
        return s

    with open("page.html") as page:  #return to here when updating
        txt = page.read()
        soup = bs4.BeautifulSoup(txt, features="lxml")
        soup.body.append(dfhtml) #adds table to <body>
        with open("page.html", "w") as outf:
            outf.write(unescape(str(soup))) #writes to page.html

    """lets say you want to make seamless table updates to your 
    webpage instead of using flask or django x_x; return to with open function"""
    soup.table.clear()  #clears everything in <table></table>
    soup.table.append(dfhtml_update)
    with open("page.html", "w") as outf:
        outf.write(unescape(str(soup))) 

I'm a newbie, but after tons of searching I just combined a bunch of fundamental teachings from the documentation...Kind of bloated, but so is working with literally billions of cells of data. This works for me



来源:https://stackoverflow.com/questions/21757377/how-to-remove-content-in-nested-tags-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!