lxml

Why is the slash at the end of lxml.html.parse() important?

孤人 提交于 2019-12-11 04:43:30
问题 I am using lxml to scrape html. This code works. lxml.html.parse( "http://google.com/" ) This code does not. lxml.html.parse( "http://google.com" ) Why does the slash at the end of the URL matter? Thank you. To be clear, here is the error log that python is giving me from the latter code. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/davidfaux/epd-7.2-2-rh5-x86/lib/python2.7/site-packages/lxml/html/__init__.py", line 692, in parse return etree.parse

building django template files with xslt

时间秒杀一切 提交于 2019-12-11 04:12:01
问题 I have about 4,000 html documents that i am trying to convert into django templates using xslt. The problem that I am having is that xslt is escaping the '{' curly braces for template variables, when I try to include a template variable inside of an attribute tag; my xslt file looks like this: <xsl:template match="p"> <p> <xsl:attribute name="nid"><xsl:value-of select="$node_id"/></xsl:attribute> <xsl:apply-templates select="*|node()"/> </p> <span> {% get_comment_count for thing '<xsl:value

lxml removing <?xml …> tags when parsing?

喜你入骨 提交于 2019-12-11 04:04:08
问题 I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...> . For example from lxml import etree tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser()) print etree.tostring(tree) will result in <dmodule>test</dmodule> Does anyone know why the <?xml ...> element is being removed? I thought

schematron report issue with python lxml

六月ゝ 毕业季﹏ 提交于 2019-12-11 03:11:26
问题 I'm validating xml documents with the lxml schematron module. It works well but I can't display the validation report, which is set as a property. I can't find how to process it as an XML tree. Here the snippet of the code I use: xdoc = etree.parse("mydoc.xml") # schematron code removed for clarity f = StringIO.StringIO('''<schema>...</schema>''') sdoc = etree.parse(f) schematron = isoschematron.Schematron(sdoc, store_schematron=True, store_xslt=True, store_report=True) if schematron.validate

Editing local XML file using Python and Regular expression

陌路散爱 提交于 2019-12-11 02:32:51
问题 I am new to python and trying to modify some xml configuration files which are present in my local system. Input: I have an xml file(say Test.xml) with the following content. <?xml version="1.0" encoding="UTF-8" standalone="no"?> <JavaHost xmlns="SomeInfo/v1.1"> <Domain> <MessageProcessor> <!-- This comment should not be removed and all formating should be untouched --> <SocketTimeout>500</SocketTimeout> </MessageProcessor> <!-- This comment should not be removed and all formating should be

How to parse text from a html table element

流过昼夜 提交于 2019-12-11 01:39:22
问题 I'm currently writing a small test webscraper using the python requests and lxml libraries. I'm trying to extract the text from the rows of a table from this site using xpaths to uniquely identify the table. Since the table itself can only be identified by its class name and given the fact that the class name isn't unique, I had to use the parent div element in order to order to specify the table. The table in question is that lists the dates of the season order, filming, and airdates for the

How to prevent lxml remove method from removing text between two elements

浪尽此生 提交于 2019-12-11 01:02:30
问题 I'm using lxml and python 2.7 to parse xml files. I need to use remove method to remove an element at some point, but very strangely it removes some text after it as well. the input xml is: <ce:para view="all">Web and grid services <ce:cross-refs refid="BIB10 BIB11">[10,11]</ce:cross-refs>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para> then I need to expand the cross-refs element to multiple cross-ref with separated refid . So the

Python XPath lxml could not read SVG path element due to empty namespace?

天涯浪子 提交于 2019-12-11 00:57:28
问题 I have an SVG (Xml) file from which I want to select some elements. For the sake of a MCRE I have cut down the file to this <svg > <!-- xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg" --> <g> <path style="fill:#19518b;fill-opacity:1;fill-rule:nonzero;stroke:none" /> <path style="fill:#a80c3d;fill-opacity:1;fill-rule:nonzero;stroke:none" /> <path style="fill:#a98b6e;fill-opacity:1;fill-rule:nonzero;stroke:none" /> </g> </svg> Where some optional namespace attributes

How to create a Text Node with lxml?

喜你入骨 提交于 2019-12-10 23:58:28
问题 I'm using lxml and python to manipulate xml files. I want to create a text node with no tags preferably, instead of creating a new Element and then append a text to it. How can I do that? I could find an equivalent of this in xml.dom.minidom package of python called createTextNode , so I was wondering if lxml supports same functionality or not? 回答1: Looks like lxml doesn't provide a special API to create text node. You can simply set text property of a parent element to create or modify text

Getting a memory error when parsing a large XML file in Python

大兔子大兔子 提交于 2019-12-10 23:43:24
问题 My XML file looks like this: <root> <group from="1", to="100"> <link target="1"/> ... <link target="100"/> </group> ... </root> I have a 6000 <group> elements and 5M <link> elements. I want to have a dictionary with the tuple ( from , to ) as keys and a list of <link> s' target attributes, but I get a memory error with following code: from lxml import etree from gzip import open as gopen def extractTargets(fin): targets = dict() with gopen(fin) as xml: context = etree.iterparse(xml, tag=