using beautifulsoup with html5lib, it puts the html, head and body tags automatically:
BeautifulSoup(\'FOO
\', \'html5lib\') # => <
Let's first create a soup sample:
soup=BeautifulSoup("content
")
You could get html and body's child by specify soup.body.:
# python3: get body's first child
print(next(soup.body.children))
# if first child's tag is rss
print(soup.body.rss)
Also you could use unwrap() to remove body, head, and html
soup.html.body.unwrap()
if soup.html.select('> head'):
soup.html.head.unwrap()
soup.html.unwrap()
If you load xml file, bs4.diagnose(data) will tell you to use lxml-xml, which will not wrap your soup with html+body
>>> BS('xxx ', 'lxml-xml')
xxx