lxml

爬虫解析库:XPath

痴心易碎 提交于 2019-12-12 21:47:33
XPath     XPath,全称 XML Path Language,即 XML 路径语言,它是一门在 XML 文档中查找信息的语言。最初是用来搜寻 XML 文档的,但同样适用于 HTML 文档的搜索。所以在做爬虫时完全可以使用 XPath 做相应的信息抽取。 1. XPath 概览     XPath 的选择功能十分强大,它提供了非常简洁明了的路径选择表达式。另外,它还提供了超过 100 个内建函数,用于字符串、数值、时间的匹配以及节点、序列的处理等,几乎所有想要定位的节点都可以用 XPath 来选择。     官方文档: https://www.w3.org/TR/xpath/ 2. XPath 常用规则 表达式 描述 nodename 选取此节点的所有子节点 / 从当前节点选区直接子节点 // 从当前节点选取子孙节点 . 选取当前节点 .. 选取当前节点的父节点 @ 选取属性     这里列出了 XPath 的常用匹配规则,示例如下: //title[@lang='eng']     这是一个 XPath 规则,代表的是选择所有名称为 title,同时属性 lang 的值为 eng 的节点,后面会通过 Python 的 lxml 库,利用 XPath 进行 HTML 的解析。 3. 安装 windows->python3环境下:pip install lxml 4.

parsing XML configuration file using Etree in python

偶尔善良 提交于 2019-12-12 19:36:30
问题 Please help me parse a configuration file of the below prototype using lxml etree. I tried with for event, element with tostring. Unfortunately I don't need the text, but the XML between <template name> <config> </template> for a given attribute. I started with this code, but get a key error while searching for the attribute since it scans from start config_tree = etree.iterparse(token_template_file) for event, element in config_tree: if element.attrib['name']=="ad auth": print ("attrib

How to import lxml xpath functions to default namespace?

时光总嘲笑我的痴心妄想 提交于 2019-12-12 15:55:51
问题 This is a example in lxml doc: >>> regexpNS = "http://exslt.org/regular-expressions" >>> find = etree.XPath("//*[re:test(., '^abc$', 'i')]", ... namespaces={'re':regexpNS}) >>> root = etree.XML("<root><a>aB</a><b>aBc</b></root>") >>> print(find(root)[0].text) aBc I want to import re:test() function to default namespace, so that I can call it without prefix re: . How can I do it? Thanks! 回答1: You can put a function in the empty function namespace: functionNS = etree.FunctionNamespace(None)

LXML failed to install on Plone 4.3 64-bit (MS Windows)

[亡魂溺海] 提交于 2019-12-12 14:26:53
问题 Been asked to update my answer in correct format - Question then Answer. running build_py creating build creating build\lib.win-amd64-2.7 creating build\lib.win-amd64-2.7\lxml copying src\lxml\builder.py -> build\lib.win-amd64-2.7\lxml then generates running build_ext building 'lxml.etree' extension **error: Setup script exited with error: Unable to find vcvarsall.bat** An error occurred when trying to install lxml 2.3.6. Look above this message for any errors that were output by easy_install

“undefined symbol: __xmlStructuredErrorContext” importing etree from lxml

允我心安 提交于 2019-12-12 12:28:46
问题 >>> import lxml >>> from lxml import etree Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: /usr/local/lib/python3.4/site-packages/lxml/etree.cpython-34m.so: undefined symbol: __xmlStructuredErrorContext i do have libxml2 and libxslt, i have tried uninstalling and reinstalling too, it didn't help. lxml version: 3.4.4, python: 3.4.2, OS: RHEL 5.5 Please help resolve this issue Thanks 回答1: Your version of lxml.etree was compiled against a different version of

Does the E-factory of lxml support dynamically generated data?

混江龙づ霸主 提交于 2019-12-12 11:22:55
问题 Is there a way of creating the tags dynamically with the E-factory of lxml? For instance I get a syntax error for the following code: E.BODY( E.TABLE( for row_num in range(len(ws.rows)): row = ws.rows[row_num] # create a tr tag E.TR( for cell_num in range(len(row)): cell = row[cell_num] I get the following error: for row_num in range(len(ws.rows)): ^ SyntaxError: invalid syntax 回答1: In order to create multiple child nodes, pass multiple positional or keyword arguments. Working example: from

Schematron validation with lxml in Python: how to retrieve validation errors?

拥有回忆 提交于 2019-12-12 10:56:52
问题 I'm trying to do some Schematron validation with lxml. For the specific application I'm working at, it's important that any tests that failed the validation are reported back. The lxml documentation mentions the presence of the validation_report property object. I think this should contain the info I'm looking for, but I just can't figure out how work with it. Here's some example code that demonstrates my problem (adapted from http://lxml.de/validation.html#id2; tested with Python 2.7.4):

Tags with : in name in lxml

这一生的挚爱 提交于 2019-12-12 10:56:32
问题 I'm trying to use lxml.etree to parse a Wordpress export document (it's XML, somewhat RSS like). I'm only interested in published posts, so I'm using the following to loop through published posts: for item in data.findall("item"): if item.find("wp:post_type").text != "post": continue if item.find("wp:status").text != "publish": continue write_post(item) where data is the tag that all item tags are found in. item tags contain posts, pages, and drafts. My problem is that lxml can't find tags

How to select parent based on the child in lxml?

那年仲夏 提交于 2019-12-12 10:54:47
问题 I have this code: <table cellspacing="1" cellpadding="1" border="0"> <tbody> <tr> <td>Something else</td> </tr> <tr> <td valign="top"> <a href="http://exact url">Something</a> </td> <td valign="top">Something else</td> </tr> </tbody> </table> I want to find the Table but is very hard to target it (the very same code is used like 10 times). But I know what is in the URL. How can I get then the parent table? 回答1: If t is the etree for this snippet of XML, then the link you're looking for is t

Wildcard namespaces in lxml

╄→尐↘猪︶ㄣ 提交于 2019-12-12 10:54:38
问题 How to query using xpath ignoring the xml namespace? I am using python lxml library. I tried the solution from this question but doesn't seem to work. In [151]: e.find("./*[local-name()='Buckets']") File "<string>", line unknown SyntaxError: invalid predicate 回答1: Use e.xpath, not e.find: import lxml.etree as ET content = '''\ <Envelope xmlns="http://www.example.com/zzz/yyy"> <Header> <Version>1</Version> </Header> <Buckets> some stuff </Buckets> </Envelope> ''' root = ET.fromstring(content)