lxml | 易学教程

Can I get lxml to ignore non-XML content before and after the root tag?

阅读更多关于 Can I get lxml to ignore non-XML content before and after the root tag?

问题 I'm trying to use lxml to process a file that may have some non-xml junk both before and after the XML content, imagine someone captured a terminal buffer and I have something like this: user@host: cat /tmp/log.xml <log> <foo>...</foo> <bar>.. ... </bar> </log> user@host: If I hand etree.parse the filename, it chokes on the beginning content. I can delete the first set of lines until I find a line starting with '<' and hand that to etree.parse, but then it chokes on the closing content. The

Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

阅读更多关于 Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

问题 I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.: <a href="http://ad_network.com/abc.html"> <img src="ad_banner.jpg"> </a> As a newbie with xpath, I can select such anchors via lxml like so: text = ''' <a href="http://ad_network.com/abc.html"> <img src="ad_banner.jpg"> </a>''' root = lxml.html.fromstring(text) print root.xpath('//a[contains(@href,("ad_network.")) or contains(@href,("other_ad_network."))][descendant::img]') In the

lxml requests on repl.it

阅读更多关于 lxml requests on repl.it

问题 I'm trying lxml requests on Replit and I don't understand why it isn't working. The program doesn't stop running until the max retries, where I get this error: Traceback (most recent call last): File "python", line 6, in requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.presidency.ucsb.edu', port=80): Max retries exceeded with url: /ws/index.php?pid=29400.html (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known',)) my

lxml tostring() returns blank string in Flask running on mod-wsgi

阅读更多关于 lxml tostring() returns blank string in Flask running on mod-wsgi

问题 I have a Python 2.7.6 Flask application that is trying to parse a SAML XML document using the lxml library. I'm running into an issue where etree.tostring(...) returns an empty string. etree_string = etree.tostring(etree.fromstring(b'<test1><test2></test2></test1>')) return etree_string # output: '' This appears to only occur when the code is run within the Flask app, served by mod_wsgi in Apache. I say this because in the same virtualenv, if I open a python interpreter and run: >>> etree

Installing lxml with pip in virtualenv Ubuntu 12.10 error: command 'gcc' failed with exit status 4

阅读更多关于 Installing lxml with pip in virtualenv Ubuntu 12.10 error: command 'gcc' failed with exit status 4

问题 I'm having the following error when trying to run "pip install lxml" into a virtualenv in Ubuntu 12.10 x64. I have Python 2.7. I have seen other related questions here about the same problem and tried installing python-dev, libxml2-dev and libxslt1-dev. Please take a look of the traceback from the moment I tip the command to the moment when the error occurs. Downloading/unpacking lxml Running setup.py egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown

wrap the contents of a tag with BeautifulSoup

阅读更多关于 wrap the contents of a tag with BeautifulSoup

问题 I'm tring to wrap the contents of a tag with BeautifulSoup. This: <div class="footnotes"> <p>Footnote 1</p> <p>Footnote 2</p> </div> should become this: <div class="footnotes"> <ol> <p>Footnote 1</p> <p>Footnote 2</p> </ol> </div> So I use the following code: footnotes = soup.findAll("div", { "class" : "footnotes" }) footnotes_contents = '' new_ol = soup.new_tag("ol") for content in footnotes[0].children: new_tag = soup.new_tag(content) new_ol.append(new_tag) footnotes[0].clear() footnotes[0]

Installing lxml on Mac OS X 10.6.8 with gcc 4.2 [duplicate]

阅读更多关于 Installing lxml on Mac OS X 10.6.8 with gcc 4.2 [duplicate]

问题 This question already has answers here : How do you install lxml on OS X Leopard without using MacPorts or Fink? (15 answers) Closed 6 years ago . I've installed gcc on Mac OS X 10.6.8 using the osx-gcc-installer. Downloading XCode would take forever, but I managed to download and install this 170-Mb package, and I am able to compile a Hello, world! program using iostream and std::cout . Then I tried to install lxml using python's easy_install lxml . It couldn't find gcc-4.0 . I added a

How to transform XML to text

阅读更多关于 How to transform XML to text

问题 Following on from my earlier question (how to transform XML?), I now have a nicely structured XML doc, like this.. <?xml version="1.0" encoding="UTF-8"?> <root> <employee id="1" reportsTo="1" title="CEO"> <employee id="2" reportsTo="1" title="Director of Operations"> <employee id="3" reportsTo="2" title="Human Resources Manager" /> </employee> </employee> </root> Now I need to convert it to javascript like this.. var treeData = [ { "name": "CEO", "parent": "null", "children": [ { "name":

Python parsing: lxml to get just part of a tag's text

阅读更多关于 Python parsing: lxml to get just part of a tag's text

问题 I'm working in Python with HTML that looks like this. I'm parsing with lxml, but could equally happily use pyquery: <p><span class="Title">Name</span>Dave Davies</p> <p><span class="Title">Address</span>123 Greyfriars Road, London</p> Pulling out 'Name' and 'Address' is dead easy, whatever library I use, but how do I get the remainder of the text - i.e. 'Dave Davies'? 回答1: Each Element can have a text and a tail attribute (in the link, search for the word "tail"): import lxml.etree content=''

Difficulty creating lxml Element subclass

阅读更多关于 Difficulty creating lxml Element subclass

问题 I’m trying to create a subclass of the Element class. I’m having trouble getting started though. from lxml import etree try: import docx except ImportError: from docx import docx class File(etree.ElementBase): def _init(self): etree.ElementBase._init(self) self.body = self.append(docx.makeelement('body')) f = File() relationships = docx.relationshiplist() title = 'File' subject = 'A very special File' creator = 'Me' keywords = ['python', 'Office Open XML', 'Word'] coreprops = docx