How do I fix wrongly nested / unclosed HTML tags?

前端 未结 5 1235
悲&欢浪女
悲&欢浪女 2020-12-01 09:56

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven\'

相关标签:
5条回答
  • 2020-12-01 10:01

    I tried to use, below method but Failed on python 3

    from BeautifulSoup import BeautifulSoup
    soup = BeautifulSoup(page, 'html5lib')
    

    I tried below and got Success

    soup = bs4.BeautifulSoup(html, 'html5lib')
    f_html = soup.prettify()
    print(f'Formatted html::: {f_html}')
    
    0 讨论(0)
  • 2020-12-01 10:02

    use html5lib, work great! like this.

    soup = BeautifulSoup(data, 'html5lib')

    0 讨论(0)
  • 2020-12-01 10:16

    using BeautifulSoup:

    from BeautifulSoup import BeautifulSoup
    html = "<p><ul><li>Foo"
    soup = BeautifulSoup(html)
    print soup.prettify()
    

    gets you

    <p>
     <ul>
      <li>
       Foo
      </li>
     </ul>
    </p>
    

    As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

    using Tidy:

    import tidy
    html = "<p><ul><li>Foo"
    print tidy.parseString(html, show_body_only=True)
    

    gets you

    <ul>
    <li>Foo</li>
    </ul>
    

    Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

    print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
    

    comes out as

    <p></p>
    <ul>
    <li>Foo</li>
    </ul>
    

    Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

    Finally, Tidy can also do indenting:

    print tidy.parseString(html, show_body_only=True, indent=True)
    

    becomes

    <ul>
      <li>Foo
      </li>
    </ul>
    

    All of these have their ups and downs, but hopefully one of them is close enough.

    0 讨论(0)
  • 2020-12-01 10:22

    Run it through Tidy or one of its ported libraries.

    Try to code it by hand and you will want to gouge your eyes out.

    0 讨论(0)
  • 2020-12-01 10:24

    Just now, I got a html which lxml and pyquery didn't work good on , seems there are some errors in the html. Since Tidy is not easy to install in windows, I choose BeautifulSoup. But I found that:

    from BeautifulSoup import BeautifulSoup
    import lxml.html
    soup = BeautifulSoup(page)
    h = lxml.html(soup.prettify())
    

    act same as h = lxml.html(page)

    Which real solve my problem is soup = BeautifulSoup(page, 'html5lib').
    You should install html5lib first, then can use it as a parser in BeautifulSoup. html5lib parser seems work much better than others.

    Hope this can help someone.

    0 讨论(0)
提交回复
热议问题