lxml | 易学教程

Registering namespaces with lxml before parsing

阅读更多关于 Registering namespaces with lxml before parsing

问题 I am using lxml to parse XML from an external service that has namespaces, but doesn't register them with xmlns . I am trying to register it by hand with register_namespace , but that doesn't seem to work. from lxml import etree xml = """ <Foo xsi:type="xsd:string">bar</Foo> """ etree.register_namespace('xsi', 'http://www.w3.org/2001/XMLSchema-instance') el = etree.fromstring(xml) # lxml.etree.XMLSyntaxError: Namespace prefix xsi for type on Foo is not defined What am I missing? Oddly enough,

Really weird… can't set attributes of built-in/extension type 'lxml.etree._Element'

阅读更多关于 Really weird… can't set attributes of built-in/extension type 'lxml.etree._Element'

问题 I've changed attributes for other classes before without issues. _Element is obviously not a built-in. from lxml.etree import _Element _Element.new_attr = 54 results in: TypeError: can't set attributes of built-in/extension type 'lxml.etree._Element' 回答1: _Element is implemented in Cython. As Steve Holden explains (my emphasis), The problem is that extension types' attributes are determined by the layout of the object's slots and forever fixed in the C code that implements them: the slots can

Finding the line number of the element's ending tag in lxml

阅读更多关于 Finding the line number of the element's ending tag in lxml

问题 While parsing an XML document with lxml I want to find the starting and ending line numbers of a particular tag. I am able to find the starting tag's position by using the sourceline property on lxml.etree.Element , however I am struggling at finding the closing tag's line number. A trivial example of my attempt: import lxml.etree as ET xml_sample = b'''<?xml version="1.0" encoding="utf-8"?> <collection> <item> <value>foo</value> </item> <item> <value> bar </value> </item> </collection>'''

How to use lxml and python to pretty print a subtree of an xml file?

阅读更多关于 How to use lxml and python to pretty print a subtree of an xml file?

问题 I have the following code using python with lxml to pretty print the file example.xml : python -c ' from lxml import etree; from sys import stdout, stdin; parser=etree.XMLParser(remove_blank_text=True, strip_cdata=False); tree=etree.parse(stdin, parser) tree.write(stdout, pretty_print = True)' < example.xml I'm using lxml because it is important that I preserve the fidelity of the original file, including preserving the CDATA idioms. Here's the file example.xml that I'm using it on: <projects

How to use lxml and python to pretty print a subtree of an xml file?

阅读更多关于 How to use lxml and python to pretty print a subtree of an xml file?

Python download image with lxml

阅读更多关于 Python download image with lxml

问题 I need to find an image in a HTML code similar to this one: ... <a href="/example/1"> <img id="img" src="http://example.net/example.jpg" alt="Example" /> </a> ... I am using lxml and requests. Here is the code: import lxml from lxml import html import requests url = 'http://www.example.com' r = requests.get(url) tree = lxml.html.fromstring(r.content) img = tree.get_element_by_id("img") f = open("image.jpg",'wb') f.write(requests.get(img['src']).content) But i am getting an error: Traceback

Improve speed parsing XML with elements and namespace, into Pandas

阅读更多关于 Improve speed parsing XML with elements and namespace, into Pandas

问题 So I have a 52M xml file, which consists of 115139 elements. from lxml import etree tree = etree.parse(file) root = tree.getroot() In [76]: len(root) Out[76]: 115139 I have this function that iterates over the elements within root and inserts each parsed element inside a Pandas DataFrame. def fnc_parse_xml(file, columns): start = datetime.datetime.now() df = pd.DataFrame(columns=columns) tree = etree.parse(file) root = tree.getroot() xmlns = './/{' + root.nsmap[None] + '}' for loc,e in

Improve speed parsing XML with elements and namespace, into Pandas

阅读更多关于 Improve speed parsing XML with elements and namespace, into Pandas

How should I process XLink references with lxml in python?

阅读更多关于 How should I process XLink references with lxml in python?

问题 I've been asked to write some scripts that read in XML configuration files that make liberal use of XLink to include XML stored in multiple files. For example: <Environment xlink:href="#{common.environment}" /> (#{common.environment} is a property placeholder that gets resolved first and can be ignored here.) The company has standardized on lxml for advanced XML processing in python. I've been looking for examples or docs on how to process these occurrences under these restraints and, at a

Selenium / lxml : Get xpath

阅读更多关于 Selenium / lxml : Get xpath

问题 Is there a get_xpath method or a way to accomplish something similar in selenium or lxml.html. I have a feeling that I have seen somewhere but can't find anything like that in the docs. Pseudocode to illustrate: browser.find_element_by_name('search[1]').get_xpath() >>> '//*[@id="langsAndSearch"]/div[1]/form/input[1]' 回答1: As there is no unique mapping between an element and an xpath expression, a general solution is not possible. But if you know something about your xml/html, it might be easy