lxml

How to get path of an element in lxml?

南楼画角 提交于 2019-11-27 11:41:40
I'm searching in a HTML document using XPath from lxml in python. How can I get the path to a certain element? Here's the example from ruby nokogiri: page.xpath('//text()').each do |textnode| path = textnode.path puts path end print for example ' /html/body/div/div[1]/div[1]/p/text()[1] ' and this is the string I want to get in python. Use getpath from ElementTree objects. from lxml import etree root = etree.fromstring('<foo><bar>Data</bar><bar><baz>data</baz>' '<baz>data</baz></bar></foo>') tree = etree.ElementTree(root) for e in root.iter(): print tree.getpath(e) Prints /foo /foo/bar[1] /foo

Encoding in python with lxml - complex solution

∥☆過路亽.° 提交于 2019-11-27 11:21:44
问题 I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative: from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)) output = etree.Element("out") output.text = txt outputfile.write(etree.tostring(output, encoding=utf8)) So webfile can be in any encoding (lxml should handle this).

Installing lxml module in python

帅比萌擦擦* 提交于 2019-11-27 11:10:35
while running a python script, I got this error from lxml import etree ImportError: No module named lxml now I tried to install lxml sudo easy_install lmxl but it gives me the following error Building lxml version 2.3.beta1. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. ERROR: /bin/sh: xslt-config: not found ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt src/lxml/lxml.etree.c:4: fatal error: Python.h: No such file or directory compilation terminated. error: Setup script

src/lxml/etree_defs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory

核能气质少年 提交于 2019-11-27 10:38:21
I am running the following comand for installing the packages in that file " pip install -r requirements.txt --download-cache=~/tmp/pip-cache ". requirement.txt contains pacakages like # Data formats # ------------ PIL==1.1.7 # html5lib==0.90 httplib2==0.7.4 lxml==2.3.1 # Documentation # ------------- Sphinx==1.1 docutils==0.8.1 # Testing # ------- behave==1.1.0 dingus==0.3.2 django-testscenarios==0.7.2 mechanize==0.2.5 mock==0.7.2 testscenarios==0.2 testtools==0.9.14 wsgi_intercept==0.5.1 while comming to install "lxml" packages i am getting the following eror Requirement already satisfied

python - lxml: enforcing a specific order for attributes

半城伤御伤魂 提交于 2019-11-27 09:15:34
I have an XML writing script that outputs XML for a specific 3rd party tool. I've used the original XML as a template to make sure that I'm building all the correct elements, but the final XML does not appear like the original. I write the attributes in the same order, but lxml is writing them in its own order. I'm not sure, but I suspect that the 3rd part tool expects attributes to appear in a specific order, and I'd like to resolve this issue so I can see if its the attrib order that making it fail, or something else. Source element: <FileFormat ID="1" Name="Development Signature" PUID="dev

Equivalent to InnerHTML when using lxml.html to parse HTML

故事扮演 提交于 2019-11-27 08:51:37
I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag. <body> <h1>A title</h1> <p>Some text</p> </body> InnerHtml is therefore: <h1>A title</h1> <p>Some text</p> I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing

SSL: CERTIFICATE_VERIFY_FAILED certificate verify failed

匆匆过客 提交于 2019-11-27 08:34:14
问题 from lxml import html import requests url = "https://website.com/" page = requests.get(url) tree = html.fromstring(page.content) page.content -> SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748) I run this script but I got this error. How can I do it? 回答1: Since your URL is "an internal corporate URL" (as stated in comments), I'm guessing it uses a self-signed certificate, or is issued by a self-signed CA certificate. If that is in fact the case, you have two

Extracting lxml xpath for html table

纵然是瞬间 提交于 2019-11-27 08:23:56
I have a html doc similar to following: <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"> <div id="Symbols" class="cb"> <table class="quotes"> <tr><th>Code</th><th>Name</th> <th style="text-align:right;">High</th> <th style="text-align:right;">Low</th> </tr> <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;"> <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td> <td>A Inc.</td> <td align="right">45.44</td> <td align="right">44.26</td> <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;"> <td><a href="/xyz.com/B

Is it possible to validate an XML file against XSD 1.1 in Python?

混江龙づ霸主 提交于 2019-11-27 08:19:35
问题 I want validate an XML file against an XSD file using lxml.XMLSchema. But the problem is the XSD is in 1.1. So it doesn't work. This is a part of the XML: <?xml version="1.0" encoding="UTF-8"?> <dictionary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="!!assert.xsd"> <SizeType>10</SizeType> </dictionary> And this is its XSD file: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

Why is lxml.etree.iterparse() eating up all my memory?

左心房为你撑大大i 提交于 2019-11-27 07:39:16
This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with iterparse() ? import lxml.etree for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'): print "why does this consume all my memory?" I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is