I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven\'
using BeautifulSoup:
from BeautifulSoup import BeautifulSoup
html = "- Foo"
soup = BeautifulSoup(html)
print soup.prettify()
gets you
-
Foo
As far as I know, you can't control putting the
tags on separate lines from Foo.using Tidy:
import tidy
html = "- Foo"
print tidy.parseString(html, show_body_only=True)
gets you
- Foo
Unfortunately, I know of no way to keep the
tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
comes out as
- Foo
Ultimately, of course, the
tag in your example is redundant, so you might be fine with losing it.
Finally, Tidy can also do indenting:
print tidy.parseString(html, show_body_only=True, indent=True)
becomes
- Foo
All of these have their ups and downs, but hopefully one of them is close enough.