Don't put html, head and body tags automatically, beautifulsoup

前端 未结 8 1521
青春惊慌失措
青春惊慌失措 2020-12-03 09:40

using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup(\'

FOO

\', \'html5lib\') # => <
8条回答
  •  一个人的身影
    2020-12-03 10:17

    This aspect of BeautifulSoup has always annoyed the hell out of me.

    Here's how I deal with it:

    # Parse the initial html-formatted string
    soup = BeautifulSoup(html, 'lxml')
    
    # Do stuff here
    
    # Extract a string repr of the parse html object, without the  or  tags
    html = "".join([str(x) for x in soup.body.children])
    

    A quick breakdown:

    # Iterator object of all tags within the  tag (your html before parsing)
    soup.body.children
    
    # Turn each element into a string object, rather than a BS4.Tag object
    # Note: inclusive of html tags
    str(x)
    
    # Get a List of all html nodes as string objects
    [str(x) for x in soup.body.children]
    
    # Join all the string objects together to recreate your original html
    "".join()
    

    I still don't like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.

    Hopefully, the next time I Google this, I'll find my answer here.

提交回复
热议问题