lxml | 易学教程

Why is the slash at the end of lxml.html.parse() important?

阅读更多关于 Why is the slash at the end of lxml.html.parse() important?

问题 I am using lxml to scrape html. This code works. lxml.html.parse( "http://google.com/" ) This code does not. lxml.html.parse( "http://google.com" ) Why does the slash at the end of the URL matter? Thank you. To be clear, here is the error log that python is giving me from the latter code. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/davidfaux/epd-7.2-2-rh5-x86/lib/python2.7/site-packages/lxml/html/__init__.py", line 692, in parse return etree.parse

building django template files with xslt

阅读更多关于 building django template files with xslt

问题 I have about 4,000 html documents that i am trying to convert into django templates using xslt. The problem that I am having is that xslt is escaping the '{' curly braces for template variables, when I try to include a template variable inside of an attribute tag; my xslt file looks like this: <xsl:template match="p"> <p> <xsl:attribute name="nid"><xsl:value-of select="$node_id"/></xsl:attribute> <xsl:apply-templates select="*|node()"/> </p> <span> {% get_comment_count for thing '<xsl:value

lxml removing <?xml …> tags when parsing?

阅读更多关于 lxml removing tags when parsing?

问题 I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...> . For example from lxml import etree tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser()) print etree.tostring(tree) will result in <dmodule>test</dmodule> Does anyone know why the <?xml ...> element is being removed? I thought

schematron report issue with python lxml

阅读更多关于 schematron report issue with python lxml

问题 I'm validating xml documents with the lxml schematron module. It works well but I can't display the validation report, which is set as a property. I can't find how to process it as an XML tree. Here the snippet of the code I use: xdoc = etree.parse("mydoc.xml") # schematron code removed for clarity f = StringIO.StringIO('''<schema>...</schema>''') sdoc = etree.parse(f) schematron = isoschematron.Schematron(sdoc, store_schematron=True, store_xslt=True, store_report=True) if schematron.validate

Editing local XML file using Python and Regular expression

阅读更多关于 Editing local XML file using Python and Regular expression

问题 I am new to python and trying to modify some xml configuration files which are present in my local system. Input: I have an xml file(say Test.xml) with the following content. <?xml version="1.0" encoding="UTF-8" standalone="no"?> <JavaHost xmlns="SomeInfo/v1.1"> <Domain> <MessageProcessor>  <SocketTimeout>500</SocketTimeout> </MessageProcessor> <!-- This comment should not be removed and all formating should be

How to parse text from a html table element

阅读更多关于 How to parse text from a html table element

问题 I'm currently writing a small test webscraper using the python requests and lxml libraries. I'm trying to extract the text from the rows of a table from this site using xpaths to uniquely identify the table. Since the table itself can only be identified by its class name and given the fact that the class name isn't unique, I had to use the parent div element in order to order to specify the table. The table in question is that lists the dates of the season order, filming, and airdates for the

How to prevent lxml remove method from removing text between two elements

阅读更多关于 How to prevent lxml remove method from removing text between two elements

问题 I'm using lxml and python 2.7 to parse xml files. I need to use remove method to remove an element at some point, but very strangely it removes some text after it as well. the input xml is: <ce:para view="all">Web and grid services <ce:cross-refs refid="BIB10 BIB11">[10,11]</ce:cross-refs>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para> then I need to expand the cross-refs element to multiple cross-ref with separated refid . So the

Python XPath lxml could not read SVG path element due to empty namespace?

阅读更多关于 Python XPath lxml could not read SVG path element due to empty namespace?

问题 I have an SVG (Xml) file from which I want to select some elements. For the sake of a MCRE I have cut down the file to this <svg >  <g> <path style="fill:#19518b;fill-opacity:1;fill-rule:nonzero;stroke:none" /> <path style="fill:#a80c3d;fill-opacity:1;fill-rule:nonzero;stroke:none" /> <path style="fill:#a98b6e;fill-opacity:1;fill-rule:nonzero;stroke:none" /> </g> </svg> Where some optional namespace attributes

How to create a Text Node with lxml?

阅读更多关于 How to create a Text Node with lxml?

问题 I'm using lxml and python to manipulate xml files. I want to create a text node with no tags preferably, instead of creating a new Element and then append a text to it. How can I do that? I could find an equivalent of this in xml.dom.minidom package of python called createTextNode , so I was wondering if lxml supports same functionality or not? 回答1: Looks like lxml doesn't provide a special API to create text node. You can simply set text property of a parent element to create or modify text

Getting a memory error when parsing a large XML file in Python

阅读更多关于 Getting a memory error when parsing a large XML file in Python

问题 My XML file looks like this: <root> <group from="1", to="100"> <link target="1"/> ... <link target="100"/> </group> ... </root> I have a 6000 <group> elements and 5M <link> elements. I want to have a dictionary with the tuple ( from , to ) as keys and a list of <link> s' target attributes, but I get a memory error with following code: from lxml import etree from gzip import open as gopen def extractTargets(fin): targets = dict() with gopen(fin) as xml: context = etree.iterparse(xml, tag=