I\'m trying to parse some html in Python. There were some methods that actually worked before... but nowadays there\'s nothing I can actually use without workarounds.
Make sure that you use the html
module when you parse HTML with lxml
:
>>> from lxml import html
>>> doc = """
...
... Meh
...
...
... Look at this interesting use of
... rather than using
tags as line breaks
... """
>>> html.document_fromstring(doc)
All the errors & exceptions will melt away, you'll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.