lxml | 易学教程

How to get path of an element in lxml?

阅读更多关于 How to get path of an element in lxml?

I'm searching in a HTML document using XPath from lxml in python. How can I get the path to a certain element? Here's the example from ruby nokogiri: page.xpath('//text()').each do |textnode| path = textnode.path puts path end print for example ' /html/body/div/div[1]/div[1]/p/text()[1] ' and this is the string I want to get in python. Use getpath from ElementTree objects. from lxml import etree root = etree.fromstring('<foo><bar>Data</bar><bar><baz>data</baz>' '<baz>data</baz></bar></foo>') tree = etree.ElementTree(root) for e in root.iter(): print tree.getpath(e) Prints /foo /foo/bar[1] /foo

Encoding in python with lxml - complex solution

阅读更多关于 Encoding in python with lxml - complex solution

问题 I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative: from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)) output = etree.Element("out") output.text = txt outputfile.write(etree.tostring(output, encoding=utf8)) So webfile can be in any encoding (lxml should handle this).

Installing lxml module in python

阅读更多关于 Installing lxml module in python

while running a python script, I got this error from lxml import etree ImportError: No module named lxml now I tried to install lxml sudo easy_install lmxl but it gives me the following error Building lxml version 2.3.beta1. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. ERROR: /bin/sh: xslt-config: not found ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt src/lxml/lxml.etree.c:4: fatal error: Python.h: No such file or directory compilation terminated. error: Setup script

src/lxml/etree_defs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory

阅读更多关于 src/lxml/etree_defs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory

I am running the following comand for installing the packages in that file " pip install -r requirements.txt --download-cache=~/tmp/pip-cache ". requirement.txt contains pacakages like # Data formats # ------------ PIL==1.1.7 # html5lib==0.90 httplib2==0.7.4 lxml==2.3.1 # Documentation # ------------- Sphinx==1.1 docutils==0.8.1 # Testing # ------- behave==1.1.0 dingus==0.3.2 django-testscenarios==0.7.2 mechanize==0.2.5 mock==0.7.2 testscenarios==0.2 testtools==0.9.14 wsgi_intercept==0.5.1 while comming to install "lxml" packages i am getting the following eror Requirement already satisfied

python - lxml: enforcing a specific order for attributes

阅读更多关于 python - lxml: enforcing a specific order for attributes

I have an XML writing script that outputs XML for a specific 3rd party tool. I've used the original XML as a template to make sure that I'm building all the correct elements, but the final XML does not appear like the original. I write the attributes in the same order, but lxml is writing them in its own order. I'm not sure, but I suspect that the 3rd part tool expects attributes to appear in a specific order, and I'd like to resolve this issue so I can see if its the attrib order that making it fail, or something else. Source element: <FileFormat ID="1" Name="Development Signature" PUID="dev

Equivalent to InnerHTML when using lxml.html to parse HTML

阅读更多关于 Equivalent to InnerHTML when using lxml.html to parse HTML

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag. <body> <h1>A title</h1> <p>Some text</p> </body> InnerHtml is therefore: <h1>A title</h1> <p>Some text</p> I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing

SSL: CERTIFICATE_VERIFY_FAILED certificate verify failed

阅读更多关于 SSL: CERTIFICATE_VERIFY_FAILED certificate verify failed

问题 from lxml import html import requests url = "https://website.com/" page = requests.get(url) tree = html.fromstring(page.content) page.content -> SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748) I run this script but I got this error. How can I do it? 回答1: Since your URL is "an internal corporate URL" (as stated in comments), I'm guessing it uses a self-signed certificate, or is issued by a self-signed CA certificate. If that is in fact the case, you have two

Extracting lxml xpath for html table

阅读更多关于 Extracting lxml xpath for html table

I have a html doc similar to following: <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"> <div id="Symbols" class="cb"> <table class="quotes"> <tr><th>Code</th><th>Name</th> <th style="text-align:right;">High</th> <th style="text-align:right;">Low</th> </tr> <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;"> <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td> <td>A Inc.</td> <td align="right">45.44</td> <td align="right">44.26</td> <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;"> <td><a href="/xyz.com/B

Is it possible to validate an XML file against XSD 1.1 in Python?

阅读更多关于 Is it possible to validate an XML file against XSD 1.1 in Python?

问题 I want validate an XML file against an XSD file using lxml.XMLSchema. But the problem is the XSD is in 1.1. So it doesn't work. This is a part of the XML: <?xml version="1.0" encoding="UTF-8"?> <dictionary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="!!assert.xsd"> <SizeType>10</SizeType> </dictionary> And this is its XSD file: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

Why is lxml.etree.iterparse() eating up all my memory?

阅读更多关于 Why is lxml.etree.iterparse() eating up all my memory?

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with iterparse() ? import lxml.etree for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'): print "why does this consume all my memory?" I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is