Clean Up HTML in Python

后端未结

关注

 5  2099

有刺的猬 2020-12-08 16:22

I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal

5条回答

青春惊慌失措 (楼主)

2020-12-08 17:14

Here is an example of cleaning up HTML using the lxml.html.clean.Cleaner module:

import sys

from lxml.html.clean import Cleaner


def sanitize(dirty_html):
    cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )

    return cleaner.clean_html(dirty_html)


if __name__ == '__main__':

    with open(sys.argv[1]) as fin:

        print(sanitize(fin.read()))

Check out the docs for a full list of options you can pass to the Cleaner.

0 讨论(0)

查看其它5个回答