发表新帖

发表新帖

How to parse malformed HTML in python, using standard libraries

后端未结

关注

 6  731

不知归路 2020-12-08 04:36

There are so many html and xml libraries built into python, that it\'s hard to believe there\'s no support for real-world HTML parsing.

I\'ve found plenty of great t

6条回答

长情又很酷 (楼主)

2020-12-08 04:47

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题