lxml | 易学教程

extract href values containing keyword using XPath in python

阅读更多关于 extract href values containing keyword using XPath in python

问题 I know variants of this question have been asked a number of times but I've not been able to crack it and get what I want. I have a website which has a few tables in it. The table of interest contains a column where each row contains the word Text hyperlinked to a different page. Here is a specific example from the first row on the above linked page: <a href="_alexandria_RIC_VI_099b_K-AP.txt">Text</a> This is the general pattern: <a href="_something_something-blah-blah.txt">Text</a> Right now

extract href values containing keyword using XPath in python

阅读更多关于 extract href values containing keyword using XPath in python

lxml encoding error when parsing utf8 xml

阅读更多关于 lxml encoding error when parsing utf8 xml

问题 I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 : UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence Other characters before this are printed out correctly. The code is: parser = etree.XMLParser(encoding='utf-8') tree = etree.parse("filename.xml", parser) root = tree.getroot() for elem in root: print elem[0].text Does the error mean that it didn't parse

create a pandas dataframe from a nested xml file

阅读更多关于 create a pandas dataframe from a nested xml file

问题 Here is a small section of an xml file. I would like to create a database from this with each tag unique columns names and non-duplicated data. Tried using lxml and the best I have been able to do so far is to create a dataframe that results in something like this: " SRCSGT DATE 11112017 AGENCY Department of Veterans Affairs OFFICE Canandaigua VAMC LOCATION Department of Veterans Affairs Medical Center ZIP 14424 etc, etc, " The xml <?xml version="1.0" encoding="UTF-8"?> <NOTICES> <SRCSGT>

lxml and xml namespaces - Using find and findall to get XML Tag Value

阅读更多关于 lxml and xml namespaces - Using find and findall to get XML Tag Value

问题 I had issues in getting the text value of and nodes using lxml where the XML text has namespaces in it. I was using findall('Status') but the result was always coming to null. I arrived at the following working code in the end....Is this the correct way of using lxml for fetching node values? Can i improve this further? import lxml xml_string='<?xml version="1.0" encoding="UTF-8"?> <SCPP:Response xmlns:SCPP="http://www.SCPP.com/XMLSchema"> <SCPP:RESP_BODY> <Seed>001335834994</Seed> </SCPP

How to remove empty XML tags, containing whitespace only, in XML?

阅读更多关于 How to remove empty XML tags, containing whitespace only, in XML?

问题 I need to remove cases like this: <text> </text> I have codes that works when there is no whitespace, but what about if there is whitespace? Code: doc = etree.XML("""<root><a>1</a><b><c></c></b><d></d></root>""") def remove_empty_elements(doc): for element in doc.xpath('//*[not(node())]'): element.getparent().remove(element) I also need to do it with lxml and not BeautifulSoup. 回答1: This XPath, //*[not(*)][not(normalize-space())] will select all leaf elements with only whitespace content. For

How to remove empty XML tags, containing whitespace only, in XML?

阅读更多关于 How to remove empty XML tags, containing whitespace only, in XML?

How do I wrap the contents of a SubElement in an XML tag in Python 3?

阅读更多关于 How do I wrap the contents of a SubElement in an XML tag in Python 3?

问题 I have a sample xml file like this: <root> She <opt>went</opt> <opt>didn't go</opt> to school. </root> I want to create a subelement named of , and put all the contents of into it. That is, <root> <sentence> She <opt>went</opt> <opt>didn't go</opt> to school. </sentence> </root> I know hot to make a subelement with ElementTree or lxml, but I have no idea of how to select from "She" to "shools." all at once. import lxml.etree as ET ET.SubElement(root, 'sentence') I'm lost... 回答1: You could go

lmxl incremental XML serialisation repeats namespaces

阅读更多关于 lmxl incremental XML serialisation repeats namespaces

问题 I am currently serializing some largish XML files in Python with lxml. I want to use the incremental writer for that. My XML format heavily relies on namespaces and attributes. When I run the following code from io import BytesIO from lxml import etree sink = BytesIO() nsmap = { 'test': 'http://test.org', 'foo': 'http://foo.org', 'bar': 'http://bar.org', } with etree.xmlfile(sink) as xf: with xf.element("test:testElement", nsmap=nsmap): name = etree.QName(nsmap["foo"], "fooElement") elem =

lmxl incremental XML serialisation repeats namespaces

阅读更多关于 lmxl incremental XML serialisation repeats namespaces