using beautifulsoup with html5lib, it puts the html, head and body tags automatically:
BeautifulSoup(\'FOO
\', \'html5lib\') # => <
This aspect of BeautifulSoup has always annoyed the hell out of me.
Here's how I deal with it:
# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')
# Do stuff here
# Extract a string repr of the parse html object, without the or tags
html = "".join([str(x) for x in soup.body.children])
A quick breakdown:
# Iterator object of all tags within the tag (your html before parsing)
soup.body.children
# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)
# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]
# Join all the string objects together to recreate your original html
"".join()
I still don't like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.
Hopefully, the next time I Google this, I'll find my answer here.