lxml | 易学教程

python3解析库BeautifulSoup4

阅读更多关于 python3解析库BeautifulSoup4

Beautiful Soup是python的一个HTML或XML的解析库，我们可以用它来方便的从网页中提取数据，它拥有强大的API和多样的解析方式。 Beautiful Soup的三个特点： Beautiful Soup提供一些简单的方法和python式函数，用于浏览，搜索和修改解析树，它是一个工具箱，通过解析文档为用户提供需要抓取的数据 Beautiful Soup自动将转入稳定转换为Unicode编码，输出文档转换为UTF-8编码，不需要考虑编码，除非文档没有指定编码方式，这时只需要指定原始编码即可 Beautiful Soup位于流行的Python解析器（如lxml和html5lib）之上，允许您尝试不同的解析策略或交易速度以获得灵活性。 1、Beautiful Soup4的安装配置 Beautiful Soup4通过PyPi发布，所以可以通过系统管理包工具安装，包名字为beautifulsoup4 $easy_install beautifulsoup4 或者 $pip install beautifulsoup4 也可用通过下载源码包来安装： #wget https://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz #tar xf

How to get css attribute of a lxml element?

阅读更多关于 How to get css attribute of a lxml element?

问题 I want to find a fast function to get all style properties of a lxml element that take into account the css stylesheet, the style attribute element and tackle the herit issue. For example : html : <body> A B B </body> css : body {color:red;font-size:12px} p.b {color:pink;} python : elements = document.xpath('//p') print get_style(element[0]) >{color:red,font-size:12px} print get_style(element[1]) >{color:pink,font-size:12px} print get_style

lxml cssselect Parsing

阅读更多关于 lxml cssselect Parsing

问题 I have a document with the following data: <div class="ds-list"> 1. A domesticated carnivorous mammal (Canis familiaris) related to the foxes and wolves and raised in a wide variety of breeds. </div> And I want to get everything within the class ds-list (without and tags). Currently my code is doc.cssselect('div.ds-list') , but all this picks up is the newline before the . How can I get this to do what I want it to? 回答1: Perhaps you are looking for the text_content

Parsing HTML: lxml error in Python

阅读更多关于 Parsing HTML: lxml error in Python

问题 I am writing a simple script to fetch the big grey table from here. The code I have is the following: import urllib2 from lxml import etree html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read() root = etree.XML(html) But I am getting an error on the last statement. Traceback (most recent call last): File "D:\Workspace\afi100\afi100.py", line 13, in <module> root = etree.XML(html) File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577) File

Parsing HTML: lxml error in Python

阅读更多关于 Parsing HTML: lxml error in Python

How to get an attribute of an Element that is namespaced

阅读更多关于 How to get an attribute of an Element that is namespaced

问题 I'm parsing an XML document that I receive from a vendor everyday and it uses namespaces heavily. I've minimized the problem to a minimal subset here: There are some elements I need to parse, all of which are children of an element with a specific attribute in it. I am able to use lxml.etree.Element.findall(TAG, root.nsmap) to find the candidate nodes whose attribute I need to check. I'm then trying to check the attribute of each of these Elements via the name I know it uses : which

How to get an attribute of an Element that is namespaced

阅读更多关于 How to get an attribute of an Element that is namespaced

Python lxml/beautiful soup to find all links on a web page

阅读更多关于 Python lxml/beautiful soup to find all links on a web page

问题 I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href> 's from the html... result = self._openurl(self.mainurl) content = result.read() html = lxml.html.fromstring(content) print lxml.html.find_rel_links(html,'href') 回答1: Use XPath. Something like (can't test from here): urls = html.xpath('//a/@href') 回答2: With iterlinks, lxml provides an excellent function for this

Error 'failed to load external entity' when using Python lxml

阅读更多关于 Error 'failed to load external entity' when using Python lxml

问题 I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error: ': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?> That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far: import urllib2 import lxml.etree as etree file =

Error parsing a DTD using lxml

阅读更多关于 Error parsing a DTD using lxml

问题 I'm trying to write a validation script that will validate XML against the NITF DTD, http://www.iptc.org/std/NITF/3.4/specification/dtd/nitf-3-4.dtd. Based on this post I came up with the following simple script to validate a NITF XML document. Bellow is the error message I get when the script is run, which isn't very descriptive and makes it hard to debug. Any help is appreciated. #!/usr/bin/env python def main(): from lxml import etree, objectify from StringIO import StringIO f = open('nitf