Best library to parse HTML with Python 3 and example?

前端未结

关注

 6  1575

I\'m new to Python completely and am using Python 3.1 on Windows (pywin). I need to parse some HTML, to essentially extra values between specific HTML tags and am confused a

相关标签:

6条回答

自闭症患者

2020-12-24 13:32
I'm currently using lxml, and on Windows I used the installation binary from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml.
```
import lxml.html
page = lxml.html.fromstring(...)
title = page.xpath('//head/title/text()')[0]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2020-12-24 13:39

If your HTML is well formed, you have many options, such as sax and dom. If it is not well formed you need a fault tolerant parser such as Beautiful soup, element tidy, or lxml's HTML parser. No parser is perfect, when presented with a variety of broken HTML, sometimes I have to try more then one. Lxml and Elementree use a mostly compatible api that is more of a standard than Beautiful soup.

In my opinion, lxml is the best module for working with xml documents, but the ElementTree included with python is still pretty good. In the past I have used Beautiful soup to convert HTML to xml and construct ElementTree for processing the data.

0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2020-12-24 13:40

Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.

Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.

As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-24 13:40

I know this is way late, but for future reference, Beautiful Soup 4.3.2 is available as of Oct. 2013.

http://www.crummy.com/software/BeautifulSoup/bs4/download/

It is compatible with Python 3.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2020-12-24 13:44

BeautifulSoup, with its version 3.1.0.1 (January 2009) also work with Python 3.x.

I do not have have direct experience with BeautifulSoup under Py3k (although this soon should change...). I just read, however, that Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than its previous versions, so I may try and wait if possible (i.e. stay with Python 2.6 a bit longer).

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-24 13:51
you might try beautifulsoup4 which is compatible with both python2 and python3 you can use it easily by
```
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...