Filter out HTML tags and resolve entities in python

前端 未结 8 1803
暗喜
暗喜 2020-12-03 00:11

Because regular expressions scare me, I\'m trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.

8条回答
  •  一向
    一向 (楼主)
    2020-12-03 00:35

    While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.

    You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.

提交回复
热议问题