How to parse malformed HTML in python, using standard libraries

后端 未结 6 717
不知归路
不知归路 2020-12-08 04:36

There are so many html and xml libraries built into python, that it\'s hard to believe there\'s no support for real-world HTML parsing.

I\'ve found plenty of great t

6条回答
  •  北海茫月
    2020-12-08 05:01

    As already stated, there is currently no satisfying solution only with standardlib. I had faced the same problem as you, when I tried to run one of my programs on an outdated hosting environment without the possibility to install own extensions and only python2.6. Solution:

    Grab this file and the latest stable BeautifulSoup version of the 3er series (3.2.1 as of now). From the tar-file there, only pick BeautifulSoup.py, it's the only one that you really need to ship with your code. So you have these two files in your path, all you need to do then, to get a casual etree object from some HTML string, like you would get it from lxml, is this:

    from StringIO import StringIO
    import ElementSoup
    
    tree = ElementSoup.parse(StringIO(input_str))
    

    lxml itself and html5lib both require you, to compile some C-code in order to make it run. It is considerably more effort to get them working, and if your environment is restricted, or your intended audience not willing to do that, avoid them.

提交回复
热议问题