发表新帖

发表新帖

How to parse malformed HTML in python, using standard libraries

后端未结

关注

 6  717

不知归路 2020-12-08 04:36

There are so many html and xml libraries built into python, that it\'s hard to believe there\'s no support for real-world HTML parsing.

I\'ve found plenty of great t

6条回答

北海茫月 (楼主)

2020-12-08 05:01
As already stated, there is currently no satisfying solution only with standardlib. I had faced the same problem as you, when I tried to run one of my programs on an outdated hosting environment without the possibility to install own extensions and only python2.6. Solution:

Grab this file and the latest stable BeautifulSoup version of the 3er series (3.2.1 as of now). From the tar-file there, only pick BeautifulSoup.py, it's the only one that you really need to ship with your code. So you have these two files in your path, all you need to do then, to get a casual etree object from some HTML string, like you would get it from lxml, is this:
```
from StringIO import StringIO
import ElementSoup

tree = ElementSoup.parse(StringIO(input_str))
```
lxml itself and html5lib both require you, to compile some C-code in order to make it run. It is considerably more effort to get them working, and if your environment is restricted, or your intended audience not willing to do that, avoid them.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题