I\'m new to Python completely and am using Python 3.1 on Windows (pywin). I need to parse some HTML, to essentially extra values between specific HTML tags and am confused a
I'm currently using lxml, and on Windows I used the installation binary from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml.
import lxml.html
page = lxml.html.fromstring(...)
title = page.xpath('//head/title/text()')[0]
If your HTML is well formed, you have many options, such as sax
and dom
. If it is not well formed you need a fault tolerant parser such as Beautiful soup
, element tidy, or lxml's HTML parser. No parser is perfect, when presented with a variety of broken HTML, sometimes I have to try more then one. Lxml
and Elementree
use a mostly compatible api that is more of a standard than Beautiful soup
.
In my opinion, lxml
is the best module for working with xml documents, but the ElementTree
included with python is still pretty good. In the past I have used Beautiful soup
to convert HTML to xml and construct ElementTree
for processing the data.
Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.
Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.
As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.
I know this is way late, but for future reference, Beautiful Soup 4.3.2 is available as of Oct. 2013.
http://www.crummy.com/software/BeautifulSoup/bs4/download/
It is compatible with Python 3.
BeautifulSoup, with its version 3.1.0.1 (January 2009) also work with Python 3.x.
I do not have have direct experience with BeautifulSoup under Py3k (although this soon should change...). I just read, however, that Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than its previous versions, so I may try and wait if possible (i.e. stay with Python 2.6 a bit longer).
you might try beautifulsoup4 which is compatible with both python2 and python3 you can use it easily by
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Some<b>bad<i>HTML")