lxml

python3解析库BeautifulSoup4

非 Y 不嫁゛ 提交于 2020-01-01 21:09:50
Beautiful Soup是python的一个HTML或XML的解析库,我们可以用它来方便的从网页中提取数据,它拥有强大的API和多样的解析方式。 Beautiful Soup的三个特点: Beautiful Soup提供一些简单的方法和python式函数,用于浏览,搜索和修改解析树,它是一个工具箱,通过解析文档为用户提供需要抓取的数据 Beautiful Soup自动将转入稳定转换为Unicode编码,输出文档转换为UTF-8编码,不需要考虑编码,除非文档没有指定编码方式,这时只需要指定原始编码即可 Beautiful Soup位于流行的Python解析器(如lxml和html5lib)之上,允许您尝试不同的解析策略或交易速度以获得灵活性。 1、Beautiful Soup4的安装配置 Beautiful Soup4通过PyPi发布,所以可以通过系统管理包工具安装,包名字为beautifulsoup4 $easy_install beautifulsoup4 或者 $pip install beautifulsoup4 也可用通过下载源码包来安装: #wget https://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz #tar xf

How to get css attribute of a lxml element?

拥有回忆 提交于 2020-01-01 19:24:10
问题 I want to find a fast function to get all style properties of a lxml element that take into account the css stylesheet, the style attribute element and tackle the herit issue. For example : html : <body> <p>A</p> <p id='b'>B</p> <p style='color:blue'>B</p> </body> css : body {color:red;font-size:12px} p.b {color:pink;} python : elements = document.xpath('//p') print get_style(element[0]) >{color:red,font-size:12px} print get_style(element[1]) >{color:pink,font-size:12px} print get_style

lxml cssselect Parsing

半腔热情 提交于 2020-01-01 18:37:44
问题 I have a document with the following data: <div class="ds-list"> <b>1. </b> A domesticated carnivorous mammal <i>(Canis familiaris)</i> related to the foxes and wolves and raised in a wide variety of breeds. </div> And I want to get everything within the class ds-list (without <b> and <i> tags). Currently my code is doc.cssselect('div.ds-list') , but all this picks up is the newline before the <b> . How can I get this to do what I want it to? 回答1: Perhaps you are looking for the text_content

Parsing HTML: lxml error in Python

拜拜、爱过 提交于 2020-01-01 16:39:33
问题 I am writing a simple script to fetch the big grey table from here. The code I have is the following: import urllib2 from lxml import etree html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read() root = etree.XML(html) But I am getting an error on the last statement. Traceback (most recent call last): File "D:\Workspace\afi100\afi100.py", line 13, in <module> root = etree.XML(html) File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577) File

Parsing HTML: lxml error in Python

青春壹個敷衍的年華 提交于 2020-01-01 16:39:31
问题 I am writing a simple script to fetch the big grey table from here. The code I have is the following: import urllib2 from lxml import etree html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read() root = etree.XML(html) But I am getting an error on the last statement. Traceback (most recent call last): File "D:\Workspace\afi100\afi100.py", line 13, in <module> root = etree.XML(html) File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577) File

How to get an attribute of an Element that is namespaced

丶灬走出姿态 提交于 2020-01-01 09:42:17
问题 I'm parsing an XML document that I receive from a vendor everyday and it uses namespaces heavily. I've minimized the problem to a minimal subset here: There are some elements I need to parse, all of which are children of an element with a specific attribute in it. I am able to use lxml.etree.Element.findall(TAG, root.nsmap) to find the candidate nodes whose attribute I need to check. I'm then trying to check the attribute of each of these Elements via the name I know it uses : which

How to get an attribute of an Element that is namespaced

不问归期 提交于 2020-01-01 09:42:05
问题 I'm parsing an XML document that I receive from a vendor everyday and it uses namespaces heavily. I've minimized the problem to a minimal subset here: There are some elements I need to parse, all of which are children of an element with a specific attribute in it. I am able to use lxml.etree.Element.findall(TAG, root.nsmap) to find the candidate nodes whose attribute I need to check. I'm then trying to check the attribute of each of these Elements via the name I know it uses : which

Python lxml/beautiful soup to find all links on a web page

独自空忆成欢 提交于 2020-01-01 09:34:29
问题 I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href> 's from the html... result = self._openurl(self.mainurl) content = result.read() html = lxml.html.fromstring(content) print lxml.html.find_rel_links(html,'href') 回答1: Use XPath. Something like (can't test from here): urls = html.xpath('//a/@href') 回答2: With iterlinks, lxml provides an excellent function for this

Error 'failed to load external entity' when using Python lxml

本小妞迷上赌 提交于 2020-01-01 07:31:06
问题 I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error: ': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?> That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far: import urllib2 import lxml.etree as etree file =

Error parsing a DTD using lxml

巧了我就是萌 提交于 2020-01-01 06:54:49
问题 I'm trying to write a validation script that will validate XML against the NITF DTD, http://www.iptc.org/std/NITF/3.4/specification/dtd/nitf-3-4.dtd. Based on this post I came up with the following simple script to validate a NITF XML document. Bellow is the error message I get when the script is run, which isn't very descriptive and makes it hard to debug. Any help is appreciated. #!/usr/bin/env python def main(): from lxml import etree, objectify from StringIO import StringIO f = open('nitf