lxml

extract href values containing keyword using XPath in python

百般思念 提交于 2021-01-28 10:33:06
问题 I know variants of this question have been asked a number of times but I've not been able to crack it and get what I want. I have a website which has a few tables in it. The table of interest contains a column where each row contains the word Text hyperlinked to a different page. Here is a specific example from the first row on the above linked page: <a href="_alexandria_RIC_VI_099b_K-AP.txt">Text</a> This is the general pattern: <a href="_something_something-blah-blah.txt">Text</a> Right now

extract href values containing keyword using XPath in python

假装没事ソ 提交于 2021-01-28 10:32:04
问题 I know variants of this question have been asked a number of times but I've not been able to crack it and get what I want. I have a website which has a few tables in it. The table of interest contains a column where each row contains the word Text hyperlinked to a different page. Here is a specific example from the first row on the above linked page: <a href="_alexandria_RIC_VI_099b_K-AP.txt">Text</a> This is the general pattern: <a href="_something_something-blah-blah.txt">Text</a> Right now

lxml encoding error when parsing utf8 xml

假装没事ソ 提交于 2021-01-28 09:19:23
问题 I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 : UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence Other characters before this are printed out correctly. The code is: parser = etree.XMLParser(encoding='utf-8') tree = etree.parse("filename.xml", parser) root = tree.getroot() for elem in root: print elem[0].text Does the error mean that it didn't parse

create a pandas dataframe from a nested xml file

左心房为你撑大大i 提交于 2021-01-28 08:49:04
问题 Here is a small section of an xml file. I would like to create a database from this with each tag unique columns names and non-duplicated data. Tried using lxml and the best I have been able to do so far is to create a dataframe that results in something like this: " SRCSGT DATE 11112017 AGENCY Department of Veterans Affairs OFFICE Canandaigua VAMC LOCATION Department of Veterans Affairs Medical Center ZIP 14424 etc, etc, " The xml <?xml version="1.0" encoding="UTF-8"?> <NOTICES> <SRCSGT>

lxml and xml namespaces - Using find and findall to get XML Tag Value

冷暖自知 提交于 2021-01-28 05:31:02
问题 I had issues in getting the text value of and nodes using lxml where the XML text has namespaces in it. I was using findall('Status') but the result was always coming to null. I arrived at the following working code in the end....Is this the correct way of using lxml for fetching node values? Can i improve this further? import lxml xml_string='<?xml version="1.0" encoding="UTF-8"?> <SCPP:Response xmlns:SCPP="http://www.SCPP.com/XMLSchema"> <SCPP:RESP_BODY> <Seed>001335834994</Seed> </SCPP

How to remove empty XML tags, containing whitespace only, in XML?

心已入冬 提交于 2021-01-27 12:42:04
问题 I need to remove cases like this: <text> </text> I have codes that works when there is no whitespace, but what about if there is whitespace? Code: doc = etree.XML("""<root><a>1</a><b><c></c></b><d></d></root>""") def remove_empty_elements(doc): for element in doc.xpath('//*[not(node())]'): element.getparent().remove(element) I also need to do it with lxml and not BeautifulSoup. 回答1: This XPath, //*[not(*)][not(normalize-space())] will select all leaf elements with only whitespace content. For

How to remove empty XML tags, containing whitespace only, in XML?

旧时模样 提交于 2021-01-27 12:32:13
问题 I need to remove cases like this: <text> </text> I have codes that works when there is no whitespace, but what about if there is whitespace? Code: doc = etree.XML("""<root><a>1</a><b><c></c></b><d></d></root>""") def remove_empty_elements(doc): for element in doc.xpath('//*[not(node())]'): element.getparent().remove(element) I also need to do it with lxml and not BeautifulSoup. 回答1: This XPath, //*[not(*)][not(normalize-space())] will select all leaf elements with only whitespace content. For

How do I wrap the contents of a SubElement in an XML tag in Python 3?

旧街凉风 提交于 2021-01-27 07:57:08
问题 I have a sample xml file like this: <root> She <opt>went</opt> <opt>didn't go</opt> to school. </root> I want to create a subelement named of , and put all the contents of into it. That is, <root> <sentence> She <opt>went</opt> <opt>didn't go</opt> to school. </sentence> </root> I know hot to make a subelement with ElementTree or lxml, but I have no idea of how to select from "She" to "shools." all at once. import lxml.etree as ET ET.SubElement(root, 'sentence') I'm lost... 回答1: You could go

lmxl incremental XML serialisation repeats namespaces

故事扮演 提交于 2021-01-27 07:41:59
问题 I am currently serializing some largish XML files in Python with lxml. I want to use the incremental writer for that. My XML format heavily relies on namespaces and attributes. When I run the following code from io import BytesIO from lxml import etree sink = BytesIO() nsmap = { 'test': 'http://test.org', 'foo': 'http://foo.org', 'bar': 'http://bar.org', } with etree.xmlfile(sink) as xf: with xf.element("test:testElement", nsmap=nsmap): name = etree.QName(nsmap["foo"], "fooElement") elem =

lmxl incremental XML serialisation repeats namespaces

被刻印的时光 ゝ 提交于 2021-01-27 07:41:44
问题 I am currently serializing some largish XML files in Python with lxml. I want to use the incremental writer for that. My XML format heavily relies on namespaces and attributes. When I run the following code from io import BytesIO from lxml import etree sink = BytesIO() nsmap = { 'test': 'http://test.org', 'foo': 'http://foo.org', 'bar': 'http://bar.org', } with etree.xmlfile(sink) as xf: with xf.element("test:testElement", nsmap=nsmap): name = etree.QName(nsmap["foo"], "fooElement") elem =