lxml

Registering namespaces with lxml before parsing

心不动则不痛 提交于 2021-02-11 05:02:09
问题 I am using lxml to parse XML from an external service that has namespaces, but doesn't register them with xmlns . I am trying to register it by hand with register_namespace , but that doesn't seem to work. from lxml import etree xml = """ <Foo xsi:type="xsd:string">bar</Foo> """ etree.register_namespace('xsi', 'http://www.w3.org/2001/XMLSchema-instance') el = etree.fromstring(xml) # lxml.etree.XMLSyntaxError: Namespace prefix xsi for type on Foo is not defined What am I missing? Oddly enough,

Really weird… can't set attributes of built-in/extension type 'lxml.etree._Element'

匆匆过客 提交于 2021-02-10 14:18:40
问题 I've changed attributes for other classes before without issues. _Element is obviously not a built-in. from lxml.etree import _Element _Element.new_attr = 54 results in: TypeError: can't set attributes of built-in/extension type 'lxml.etree._Element' 回答1: _Element is implemented in Cython. As Steve Holden explains (my emphasis), The problem is that extension types' attributes are determined by the layout of the object's slots and forever fixed in the C code that implements them: the slots can

Finding the line number of the element's ending tag in lxml

倾然丶 夕夏残阳落幕 提交于 2021-02-09 07:13:32
问题 While parsing an XML document with lxml I want to find the starting and ending line numbers of a particular tag. I am able to find the starting tag's position by using the sourceline property on lxml.etree.Element , however I am struggling at finding the closing tag's line number. A trivial example of my attempt: import lxml.etree as ET xml_sample = b'''<?xml version="1.0" encoding="utf-8"?> <collection> <item> <value>foo</value> </item> <item> <value> bar </value> </item> </collection>'''

How to use lxml and python to pretty print a subtree of an xml file?

前提是你 提交于 2021-02-08 17:55:50
问题 I have the following code using python with lxml to pretty print the file example.xml : python -c ' from lxml import etree; from sys import stdout, stdin; parser=etree.XMLParser(remove_blank_text=True, strip_cdata=False); tree=etree.parse(stdin, parser) tree.write(stdout, pretty_print = True)' < example.xml I'm using lxml because it is important that I preserve the fidelity of the original file, including preserving the CDATA idioms. Here's the file example.xml that I'm using it on: <projects

How to use lxml and python to pretty print a subtree of an xml file?

谁说胖子不能爱 提交于 2021-02-08 17:54:56
问题 I have the following code using python with lxml to pretty print the file example.xml : python -c ' from lxml import etree; from sys import stdout, stdin; parser=etree.XMLParser(remove_blank_text=True, strip_cdata=False); tree=etree.parse(stdin, parser) tree.write(stdout, pretty_print = True)' < example.xml I'm using lxml because it is important that I preserve the fidelity of the original file, including preserving the CDATA idioms. Here's the file example.xml that I'm using it on: <projects

Python download image with lxml

情到浓时终转凉″ 提交于 2021-02-08 15:48:11
问题 I need to find an image in a HTML code similar to this one: ... <a href="/example/1"> <img id="img" src="http://example.net/example.jpg" alt="Example" /> </a> ... I am using lxml and requests. Here is the code: import lxml from lxml import html import requests url = 'http://www.example.com' r = requests.get(url) tree = lxml.html.fromstring(r.content) img = tree.get_element_by_id("img") f = open("image.jpg",'wb') f.write(requests.get(img['src']).content) But i am getting an error: Traceback

Improve speed parsing XML with elements and namespace, into Pandas

馋奶兔 提交于 2021-02-08 07:39:23
问题 So I have a 52M xml file, which consists of 115139 elements. from lxml import etree tree = etree.parse(file) root = tree.getroot() In [76]: len(root) Out[76]: 115139 I have this function that iterates over the elements within root and inserts each parsed element inside a Pandas DataFrame. def fnc_parse_xml(file, columns): start = datetime.datetime.now() df = pd.DataFrame(columns=columns) tree = etree.parse(file) root = tree.getroot() xmlns = './/{' + root.nsmap[None] + '}' for loc,e in

Improve speed parsing XML with elements and namespace, into Pandas

别说谁变了你拦得住时间么 提交于 2021-02-08 07:37:23
问题 So I have a 52M xml file, which consists of 115139 elements. from lxml import etree tree = etree.parse(file) root = tree.getroot() In [76]: len(root) Out[76]: 115139 I have this function that iterates over the elements within root and inserts each parsed element inside a Pandas DataFrame. def fnc_parse_xml(file, columns): start = datetime.datetime.now() df = pd.DataFrame(columns=columns) tree = etree.parse(file) root = tree.getroot() xmlns = './/{' + root.nsmap[None] + '}' for loc,e in

How should I process XLink references with lxml in python?

女生的网名这么多〃 提交于 2021-02-08 04:58:26
问题 I've been asked to write some scripts that read in XML configuration files that make liberal use of XLink to include XML stored in multiple files. For example: <Environment xlink:href="#{common.environment}" /> (#{common.environment} is a property placeholder that gets resolved first and can be ignored here.) The company has standardized on lxml for advanced XML processing in python. I've been looking for examples or docs on how to process these occurrences under these restraints and, at a

Selenium / lxml : Get xpath

六眼飞鱼酱① 提交于 2021-02-08 03:44:32
问题 Is there a get_xpath method or a way to accomplish something similar in selenium or lxml.html. I have a feeling that I have seen somewhere but can't find anything like that in the docs. Pseudocode to illustrate: browser.find_element_by_name('search[1]').get_xpath() >>> '//*[@id="langsAndSearch"]/div[1]/form/input[1]' 回答1: As there is no unique mapping between an element and an xpath expression, a general solution is not possible. But if you know something about your xml/html, it might be easy