How to parse malformed HTML in python, using standard libraries

后端 未结 6 721
不知归路
不知归路 2020-12-08 04:36

There are so many html and xml libraries built into python, that it\'s hard to believe there\'s no support for real-world HTML parsing.

I\'ve found plenty of great t

6条回答
  •  不思量自难忘°
    2020-12-08 04:46

    I cannot think of any popular languages with a good, robust, heuristic HTML parsing library in its stdlib. Python certainly does not have one, which is something I think you know.

    Why the requirement of a stdlib module? Most of the time when I hear people make that requirement, they are being silly. For most major tasks, you will need a third party module or to spend a whole lot of work re-implementing one. Introducing a dependency is a good thing, since that's work you didn't have to do.

    So what you want is lxml.html. Ship lxml with your code if that's an issue, at which point it becomes functionally equivalent to writing it yourself except in difficulty, bugginess, and maintainability.

提交回复
热议问题