Fast and effective way to parse broken HTML?
问题 I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages. Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the