Don't put html, head and body tags automatically, beautifulsoup

前端 未结 8 1529
青春惊慌失措
青春惊慌失措 2020-12-03 09:40

using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup(\'

FOO

\', \'html5lib\') # => <
8条回答
  •  广开言路
    2020-12-03 10:23

    Let's first create a soup sample:

    soup=BeautifulSoup("

    content

    ")

    You could get html and body's child by specify soup.body.:

    # python3: get body's first child
    print(next(soup.body.children))
    
    # if first child's tag is rss
    print(soup.body.rss)
    

    Also you could use unwrap() to remove body, head, and html

    soup.html.body.unwrap()
    if soup.html.select('> head'):
        soup.html.head.unwrap()
    soup.html.unwrap()
    

    If you load xml file, bs4.diagnose(data) will tell you to use lxml-xml, which will not wrap your soup with html+body

    >>> BS('xxx', 'lxml-xml')
    xxx
    

提交回复
热议问题