Clean up ugly WYSIWYG HTML code? Python or *nix utility

梦想与她 提交于 2019-12-04 05:04:04

You could also take a look at Bleach a white-list based HTML sanitizer. It uses html5lib to do what Kyle posted, but you'll get a lot more control over which elements and attributes are allowed in the final output.

Beautiful Soup will probably get you a more complete solution, but you might be able to get some cleanup done more simply with html5lib (if you're OK with html5 rules):

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer

my_html = "<i>Some html fragment</I>" #intentional 'I'

html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = html_parser.parseFragment(my_html)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False, quote_attr_values=True)
cleaned_html = s.render(stream)
cleaned_html == '<i>Some html fragment</i>"

You can also sanitize the html by initializing your html_parser like this:

html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"), tokenizer=sanitizer.HTMLSanitizer)
S.Lott

The standard answer is Beautiful Soup.

"Extra span" and "garbage tags" is something you'll need to define very, very carefully so you can remove the tags without removing content.

I would suggest you do two things.

  1. Fix your app so that users don't provide HTML under any circumstances. Django can use RST markup which is much more user-friendly. http://docs.djangoproject.com/en/1.3/ref/templates/builtins/#django-contrib-markup

  2. Write a Beautiful Soup parser and transform the user's content into RST markup. Keep the structural elements (headings, lists, etc.) and lose the formatting to the extent possible.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!