lxml

Python lxml library fails to parse < and >

妖精的绣舞 提交于 2019-12-22 12:50:47
问题 I have an XSLT with javascript in it which uses "&lt ;" and "&gt ;" inside for loop <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <head> </head> <body> <script language="javascript" type="text/javascript"> function example() { var trs = document.getElementsByTagName("tr"); for (var i = 0; i < trs.length; i++) { } } </script> </body> </html> I am using PYTHON LXML library to generate HTML

How to read an html table with multiple tbodies with python pandas' read_html?

末鹿安然 提交于 2019-12-22 12:18:26
问题 This is my html: import pandas as pd html_table = '''<table> <thead> <tr><th>Col1</th><th>Col2</th> </thead> <tbody> <tr><td>1a</td><td>2a</td></tr> </tbody> <tbody> <tr><td>1b</td><td>2b</td></tr> </tbody> </table>''' If I run df = pd.read_html(html_table) , and then print(df[0] I get: Col1 Col2 0 1a 2a Col 2 disappears. Why? How to prevent it? 回答1: The HTML you have posted is not a valid one . Multiple tbody s is what confuses the pandas parser logic. If you cannot fix the input html itself

Figuring out where CDATA is in lxml element?

自古美人都是妖i 提交于 2019-12-22 11:18:12
问题 I need to parse and rebuild a file format used by a parser which speaks a language that can only charitably be described as XML. I realize that standards-compliant XML doesn't care about either the CDATA or the whitespace, but unfortunately this application demands that I care about both... I'm using lxml.etree because it's pretty good at preserving CDATA. For example: s = ''' <root> <item> <![CDATA[whatever]]> </item> </root>''' import lxml.etree as et et.fromstring(s, et.XMLParser(strip

python lxml using iterparse to edit and output xml

爱⌒轻易说出口 提交于 2019-12-22 11:09:22
问题 I've been messing around with the lxml library for a little while and maybe I'm not understanding it correctly or I'm missing something but I can't seem to figure out how to edit the file after I catch a certain xpath and then be able to write that back out into xml while I'm parsing element by element. Say we have this xml as an example: <xml> <items> <pie>cherry</pie> <pie>apple</pie> <pie>chocolate</pie> </items> </xml> What I would like to do while parsing is when I hit that xpath of "

Static python method to XML escape string which supports quotes

▼魔方 西西 提交于 2019-12-22 10:45:24
问题 I have a string that have both XML escaped characters and non-escaped, and I need it to be 100% XML valid, example: >>> s = '< <' I want this to be: >>> s = '< <' I have tried numerous methods with, lxml, cgi etc.. but they all expect the input string to not have any valid XML characters already: >>> import cgi >>> cgi.escape("< <") '< &lt;' or >>> from xml.sax.saxutils import escape >>> escape("< <") '< &lt;' Isn't there a standard method for this already? Someone has to have had the same

How can I prevent lxml from auto-closing empty elements when serializing to string?

故事扮演 提交于 2019-12-22 10:39:09
问题 I am parsing a huge xml file which contains many empty elements such as <MemoryEnv></MemoryEnv> When serializing with etree.tostring(root_element, pretty_print_True) the output element is collapsed to <MemoryEnv/> Is there any way to prevent this? the etree.tostring() does not provide such a facility. Is there a way interfere with lxml's tostring() serializer? Btw, the html module does not work. It's not designed for XML, and it does not create empty elements in their original form. The

Using python-amazon-product-api on Google Appengine without lxml [duplicate]

丶灬走出姿态 提交于 2019-12-22 08:33:57
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Amazon API library for Python? I'm wanting to use the python-amazon-product-api wrapper to access the Amazon API: http://pypi.python.org/pypi/python-amazon-product-api/ Unfortunately it relies on lxml which is not supported on Google Appengine. Does anyone know a workaround? I'm only looking to do basic stuff with the API so could I use Elementtree instead? I'm a newbie so using anything other than how it comes

Fully streaming XML parser

不问归期 提交于 2019-12-22 08:19:39
问题 I'm trying to consume the Exchange GetAttachment webservice using requests, lxml and base64io. This service returns a base64-encoded file in a SOAP XML HTTP response. The file content is contained in a single line in a single XML element. GetAttachment is just an example, but the problem is more general. I would like to stream the decoded file contents directly to disk without storing the entire contents of the attachment in-memory at any point, since an attachment could be several 100 MB. I

replacing node text using lxml.objectify while preserving attributes

主宰稳场 提交于 2019-12-22 08:15:43
问题 Using lxml.objectify like so: from lxml import objectify o = objectify.fromstring("<a><b atr='someatr'>oldtext</b></a>") o.b = 'newtext' results in <a><b>newtext</b></a> , losing the node attribute. It seems to be directly replacing the element with a newly created one, rather than simply replacing the text of the element. If I try to use o.b.text = 'newtext' , it tells me that attribute 'text' of 'StringElement' objects is not writable . Is there a way to do this within objectify without

Is there a way to disable urlencoding of anchor attributes in lxml

匆匆过客 提交于 2019-12-22 07:38:11
问题 I am using lxml 2.2.8 and trying to transform some existing html files into django templates. the only problem that i am having is that lxml urlencodes the anchor name and href attributes. for example: <xsl:template match="a"> <!-- anchor attribute href is urlencoded but the title is escaped --> <a href="{{{{item.get_absolute_url}}}}" title="{{{{item.title}}}}"> <!-- name tag is urlencoded --> <xsl:attribute name="name">{{item.name}}</xsl:attribute> <!-- but other attributes are not --> <xsl