selfClosingTags in BeautifulSoup

后端 未结 2 990
Happy的楠姐
Happy的楠姐 2021-01-13 15:57

Using BeautifulSoup to parse my XML

import BeautifulSoup

soup = BeautifulSoup.BeautifulStoneSoup( \"\"\"hello         


        
相关标签:
2条回答
  • 2021-01-13 16:29

    You are asking what was in the mind of an author, after having noted that he gives names like Beautiful[Stone]Soup to classes/modules :-)

    Here are two more examples of the behaviour of BeautifulStoneSoup:

    >>> soup = BeautifulSoup.BeautifulStoneSoup(
        """<alan x="y" ><anne>hello</anne>"""
        )
    >>> print soup.prettify()
    <alan x="y">
     <anne>
      hello
     </anne>
    </alan>
    
    >>> soup = BeautifulSoup.BeautifulStoneSoup(
        """<alan x="y" ><anne>hello</anne>""",
        selfClosingTags=['alan'])
    >>> print soup.prettify()
    <alan x="y" />
    <anne>
     hello
    </anne>
    >>>
    

    My take: a self-closing tag is not legal if it is not defined to the parser. So the author had choices when deciding how to handle an illegal fragment like <alan x="y" /> ... (1) assume that the / was a mistake (2) treat alan as a self-closing tag quite independently of how it might be used elsewhere in the input (3) make 2 passes over the input nutting out in the first pass how each tag was used. Which choice do you prefer?

    0 讨论(0)
  • 2021-01-13 16:34

    I don't have a "why", but this might be of interest to you. If you use BeautifulSoup (no Stone) to parse XML with a self-closing tag, it works. Sort of:

    >>> soup = BeautifulSoup.BeautifulSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])
    >>> print soup.prettify()
    <alan x="y">
    </alan>
    <anne>
     hello
    </anne>
    

    The nesting is right, even if alan is rendered as a start and an end tag.

    0 讨论(0)
提交回复
热议问题