lxml

Can I get lxml to ignore non-XML content before and after the root tag?

我的梦境 提交于 2019-12-12 02:55:43
问题 I'm trying to use lxml to process a file that may have some non-xml junk both before and after the XML content, imagine someone captured a terminal buffer and I have something like this: user@host: cat /tmp/log.xml <log> <foo>...</foo> <bar>.. ... </bar> </log> user@host: If I hand etree.parse the filename, it chokes on the beginning content. I can delete the first set of lines until I find a line starting with '<' and hand that to etree.parse, but then it chokes on the closing content. The

Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

别说谁变了你拦得住时间么 提交于 2019-12-12 02:48:55
问题 I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.: <a href="http://ad_network.com/abc.html"> <img src="ad_banner.jpg"> </a> As a newbie with xpath, I can select such anchors via lxml like so: text = ''' <a href="http://ad_network.com/abc.html"> <img src="ad_banner.jpg"> </a>''' root = lxml.html.fromstring(text) print root.xpath('//a[contains(@href,("ad_network.")) or contains(@href,("other_ad_network."))][descendant::img]') In the

lxml requests on repl.it

大兔子大兔子 提交于 2019-12-12 01:57:11
问题 I'm trying lxml requests on Replit and I don't understand why it isn't working. The program doesn't stop running until the max retries, where I get this error: Traceback (most recent call last): File "python", line 6, in requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.presidency.ucsb.edu', port=80): Max retries exceeded with url: /ws/index.php?pid=29400.html (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known',)) my

lxml tostring() returns blank string in Flask running on mod-wsgi

可紊 提交于 2019-12-12 01:43:50
问题 I have a Python 2.7.6 Flask application that is trying to parse a SAML XML document using the lxml library. I'm running into an issue where etree.tostring(...) returns an empty string. etree_string = etree.tostring(etree.fromstring(b'<test1><test2></test2></test1>')) return etree_string # output: '' This appears to only occur when the code is run within the Flask app, served by mod_wsgi in Apache. I say this because in the same virtualenv, if I open a python interpreter and run: >>> etree

Installing lxml with pip in virtualenv Ubuntu 12.10 error: command 'gcc' failed with exit status 4

我怕爱的太早我们不能终老 提交于 2019-12-12 01:20:16
问题 I'm having the following error when trying to run "pip install lxml" into a virtualenv in Ubuntu 12.10 x64. I have Python 2.7. I have seen other related questions here about the same problem and tried installing python-dev, libxml2-dev and libxslt1-dev. Please take a look of the traceback from the moment I tip the command to the moment when the error occurs. Downloading/unpacking lxml Running setup.py egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown

wrap the contents of a tag with BeautifulSoup

旧城冷巷雨未停 提交于 2019-12-11 23:54:03
问题 I'm tring to wrap the contents of a tag with BeautifulSoup. This: <div class="footnotes"> <p>Footnote 1</p> <p>Footnote 2</p> </div> should become this: <div class="footnotes"> <ol> <p>Footnote 1</p> <p>Footnote 2</p> </ol> </div> So I use the following code: footnotes = soup.findAll("div", { "class" : "footnotes" }) footnotes_contents = '' new_ol = soup.new_tag("ol") for content in footnotes[0].children: new_tag = soup.new_tag(content) new_ol.append(new_tag) footnotes[0].clear() footnotes[0]

Installing lxml on Mac OS X 10.6.8 with gcc 4.2 [duplicate]

不羁的心 提交于 2019-12-11 23:26:43
问题 This question already has answers here : How do you install lxml on OS X Leopard without using MacPorts or Fink? (15 answers) Closed 6 years ago . I've installed gcc on Mac OS X 10.6.8 using the osx-gcc-installer. Downloading XCode would take forever, but I managed to download and install this 170-Mb package, and I am able to compile a Hello, world! program using iostream and std::cout . Then I tried to install lxml using python's easy_install lxml . It couldn't find gcc-4.0 . I added a

How to transform XML to text

心已入冬 提交于 2019-12-11 19:33:40
问题 Following on from my earlier question (how to transform XML?), I now have a nicely structured XML doc, like this.. <?xml version="1.0" encoding="UTF-8"?> <root> <employee id="1" reportsTo="1" title="CEO"> <employee id="2" reportsTo="1" title="Director of Operations"> <employee id="3" reportsTo="2" title="Human Resources Manager" /> </employee> </employee> </root> Now I need to convert it to javascript like this.. var treeData = [ { "name": "CEO", "parent": "null", "children": [ { "name":

Python parsing: lxml to get just part of a tag's text

两盒软妹~` 提交于 2019-12-11 18:34:15
问题 I'm working in Python with HTML that looks like this. I'm parsing with lxml, but could equally happily use pyquery: <p><span class="Title">Name</span>Dave Davies</p> <p><span class="Title">Address</span>123 Greyfriars Road, London</p> Pulling out 'Name' and 'Address' is dead easy, whatever library I use, but how do I get the remainder of the text - i.e. 'Dave Davies'? 回答1: Each Element can have a text and a tail attribute (in the link, search for the word "tail"): import lxml.etree content=''

Difficulty creating lxml Element subclass

不羁岁月 提交于 2019-12-11 18:02:15
问题 I’m trying to create a subclass of the Element class. I’m having trouble getting started though. from lxml import etree try: import docx except ImportError: from docx import docx class File(etree.ElementBase): def _init(self): etree.ElementBase._init(self) self.body = self.append(docx.makeelement('body')) f = File() relationships = docx.relationshiplist() title = 'File' subject = 'A very special File' creator = 'Me' keywords = ['python', 'Office Open XML', 'Word'] coreprops = docx