lxml

XPath predicate with sub-paths with lxml?

我与影子孤独终老i 提交于 2019-12-06 01:27:13
问题 I'm trying to understand and XPath that was sent to me for use with ACORD XML forms (common format in insurance). The XPath they sent me is (truncated for brevity): ./PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo Where I'm running into trouble is that Python's lxml library is telling me that [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"] is an invalid predicate . I'm not able to find anywhere in the XPath spec on

lxml.etree fromsting() and tostring() are not returning the same data

99封情书 提交于 2019-12-06 01:15:40
I'm learning lxml (after using ElementTree) and I'm baffled why .fromstring and .tostring do not appear to be reversible. Here's my example: import lxml.etree as ET f = open('somefile.xml','r') data = f.read() tree_in = ET.fromstring(data) tree_out = ET.tostring(tree_in) f2 = open('samefile.xml','w') f2.write(tree_out) f2.close 'somefile.xml' was 132 KB. 'samefile.xml' - the output - was 113 KB, and it is missing the end of the file at some arbirtrary point. The closing tags of the overall tree and a few of the pieces of the final element are just gone. Is there something wrong with my code,

03 解析库beautifulsoup

让人想犯罪 __ 提交于 2019-12-06 01:12:12
03 解析库beautifulsoup 一 介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib

Python XpathEvaluator without namespace

北城余情 提交于 2019-12-06 00:47:26
I need to write a dynamic function that finds elements on a subtree of an ATOM xml document. To do so, I've written something like this: tree = etree.parse(xmlFileUrl) e = etree.XPathEvaluator(tree, namespaces={'def':'http://www.w3.org/2005/Atom'}) entries = e('//def:entry') for entry in entries: mypath = tree.getpath(entry) + "/category" category = e(mypath) The code above fails to find category. The reason is that getpath returns an XPath without namespaces, whereas the XPathEvaluator e() requires namespaces. Is there a way to either make getpath return namespaces in the path, or allow

i have an error when executing “from lxml import etree” in the python command line after successfully installed lxml by pip

て烟熏妆下的殇ゞ 提交于 2019-12-05 21:39:09
bash-3.2$ pip install lxml-2.3.5.tgz Unpacking ./lxml-2.3.5.tgz Running setup.py egg_info for package from file:///Users/apple/workspace/pythonhome/misc/lxml-2.3.5.tgz Building lxml version 2.3.5. Building with Cython 0.17. Using build configuration of libxslt 1.1.27 Building against libxml2/libxslt in the following directory: /usr/local/lib warning: no previously-included files found matching '*.py' Installing collected packages: lxml Running setup.py install for lxml Building lxml version 2.3.5. Building with Cython 0.17. Using build configuration of libxslt 1.1.27 Building against libxml2

How can I prevent lxml from auto-closing empty elements when serializing to string?

时光总嘲笑我的痴心妄想 提交于 2019-12-05 21:13:34
I am parsing a huge xml file which contains many empty elements such as <MemoryEnv></MemoryEnv> When serializing with etree.tostring(root_element, pretty_print_True) the output element is collapsed to <MemoryEnv/> Is there any way to prevent this? the etree.tostring() does not provide such a facility. Is there a way interfere with lxml's tostring() serializer? Btw, the html module does not work. It's not designed for XML, and it does not create empty elements in their original form. The problem is, that although collapsed and uncollapsed forms of an empty element are equivalent, the program

parsing large xml file with Python - etree.parse error

為{幸葍}努か 提交于 2019-12-05 19:47:41
问题 Trying to parse the following Python file using the lxml.etree.iterparse function. "sampleoutput.xml" <item> <title>Item 1</title> <desc>Description 1</desc> </item> <item> <title>Item 2</title> <desc>Description 2</desc> </item> I tried the code from Parsing Large XML file with Python lxml and Iterparse before the etree.iterparse(MYFILE) call I did MYFILE = open("/Users/eric/Desktop/wikipedia_map/sampleoutput.xml","r") But it turns up the following error Traceback (most recent call last):

Parsing lxml.etree._Element contents

筅森魡賤 提交于 2019-12-05 17:55:26
I have the following element that I parsed out of a <table> <td align="center" valign="top"> <a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST" target="_blank"> 5548U </a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/> </td> I am trying to extract "55488 Power La Vaca (8025K) Linux 4.2.x.x" from this element (including the spaces). import lxml.etree as ET td_html = """ <td align="center" valign="top"> <a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST" target="_blank"> 5548U </a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/> </td> """ td_elem = ET

解析库beautisoup

眉间皱痕 提交于 2019-12-05 17:19:57
一、介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib 二、基本使用 html_doc = """ <html>

爬虫 - Beautiful Soup

这一生的挚爱 提交于 2019-12-05 17:19:23
了解Beautiful Soup 中文文档: Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式 安装 beautifulsoup4 >: pip install beautifulsoup4 解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib View Code 下表列出了主要的解析器,以及它们的优缺点,官网推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib,