lxml

Running Scrapy on PyPy

淺唱寂寞╮ 提交于 2019-12-03 20:39:54
Is it possible to run Scrapy on PyPy ? I've looked through the documentation and the github project, but the only place where PyPy is mentioned is that there were some unit tests being executed on PyPy 2 years ago, see PyPy support . There is also Scrapy fails in PyPy long discussion happened 3 years ago without a concrete resolution or a follow-up. From what I understand, the main Scrapy's dependency Twisted is known to work on PyPy . Scrapy also uses lxml for HTML parsing, which has a PyPy -friendly fork . The other dependency, pyOpenSSL is fully supported (thanks to @Glyph's comment). Yes.

lxml install on windows 7 using pip and python 2.7

时光怂恿深爱的人放手 提交于 2019-12-03 18:26:45
问题 When I try to upgrade lxml using pip on my windows 7 machine I get the log printed below. When I uninstall and try to install from scratch I get the same errors. Any ideas? Downloading/unpacking lxml from https://pypi.python.org/packages/source/l/lxml/l xml-3.2.4.tar.gz#md5=cc363499060f615aca1ec8dcc04df331 Downloading lxml-3.2.4.tar.gz (3.3MB): 3.3MB downloaded Running setup.py egg_info for package lxml Building lxml version 3.2.4. Building without Cython. ERROR: Nazwa 'xslt-config' nie jest

lxml.etree, element.text doesn't return the entire text from an element

℡╲_俬逩灬. 提交于 2019-12-03 16:18:42
问题 I scrapped some html via xpath, that I then converted into an etree. Something similar to this: <td> text1 <a> link </a> text2 </td> but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements... 回答1: Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation. 回答2: As a public service to people out there who may

lxml not adding newlines when inserting a new element into existing xml

痞子三分冷 提交于 2019-12-03 16:14:16
问题 I have a large set of existing xml files, and I am trying to add one element to all of them (they are pom.xml for a number of maven projects, and I am trying to add a parent element to all of them). The following is my exact code. The problem is that the final xml output in pom2.xml has the complete parent element in a single line. Though, when I print the element by itself, it writes it out in 4 lines as usual. How do I print out the complete xml with proper formatting for the parent element

How to debug lxml.etree.XSLTParseError: Invalid expression error

两盒软妹~` 提交于 2019-12-03 16:03:14
I'm trying to find out why lxml cannot parse an XSL document which consists of a "root" document with various xml:include s. I get an error: Traceback (most recent call last): File "s.py", line 10, in <module> xslt = ET.XSLT(ET.parse(d)) File "xslt.pxi", line 409, in lxml.etree.XSLT.__init__ (src/lxml/lxml.etree.c:151978) lxml.etree.XSLTParseError: Invalid expression That tells me where in the lxml source the error is, but is there a way to get more through lxml about where in the xsl the error is , or should I be using a different method? I'm trying to provide a service that accepts XSL

Cannot install lxml on windows, fatal error C1083: Cannot open include file: 'libxml/xmlversion.h'

只愿长相守 提交于 2019-12-03 15:44:29
Python noob, please bear with me. I used python installer for v3.5.1 from www.python.org. My intent was to use Scrapy to run some scripts. pip install scrapy failed, as did easy_install scrapy and others. I traced the error to a faulty install of lxml. Here is the error log. I've even tried easy_installing libxml2, I'm not sure how to proceed. Building lxml version 3.5.0. Building without Cython. ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n" ** make sure the development packages of libxml2 and libxslt are installed ** Using

Valid XPath expression

只谈情不闲聊 提交于 2019-12-03 15:33:19
Just two questions: How can I check if the string assigned to a variable corresponds to a valid XPath expression? How can I return a customized error message in case the requested resource does not exist? If the XPath is invalid, you'll get an exception. If the requested node does not exist, you'll get an empty result set. For example: from lxml import etree from StringIO import StringIO tree = etree.parse(StringIO('<foo><bar></bar></foo>')) try: tree.xpath('\BAD XPATH') print '1. Valid XPath' except etree.XPathEvalError, e: print '1. Invalid XPath: ', e if not tree.xpath('/foo/xxx'): print '2

Python lxml iterfind w/ namespace but prefix=None

风格不统一 提交于 2019-12-03 14:41:22
I want to perform iterfind() for elements which have a namespace but no prefix. I'd like to call iterfind([tagname]) or iterfind([tagname], [namespace dict]) I don't care to enter the tag as follows every time: "{%s}tagname" % tree.nsmap[None] Details I'm running through an xml response from a Google API. The root node defines several namespaces, including one for which there is no prefix: xmlns="http://www.w3.org/2005/Atom" It looks as though when I try to search through my etree, everything behaves as I would expect for elements with a prefix. e.g.: >>> for x in root.iterfind('dxp:segment'):

Stripping python namespace attributes from an lxml.objectify.ObjectifiedElement [duplicate]

人走茶凉 提交于 2019-12-03 14:29:13
Possible Duplicate: When using lxml, can the XML be rendered without namespace attributes? How can I strip the python attributes from an lxml.objectify.ObjectifiedElement ? Example: In [1]: from lxml import etree, objectify In [2]: foo = objectify.Element("foo") In [3]: foo.bar = "hi" In [4]: foo.baz = 1 In [5]: foo.fritz = None In [6]: print etree.tostring(foo, pretty_print=True) <foo xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="TREE"> <bar py:pytype="str">hi</bar> <baz py

remove certain attributes from HTML tags

隐身守侯 提交于 2019-12-03 14:26:38
How can I remove certain attributes such as id, style, class, etc. from HTML code? I thought I could use the lxml.html.clean module , but as it turned out I can only remove style attributes with Clean(style=True).clean_html(code) . I'd prefer not to use regular expressions for this task (attributes could change). What I would like to have: from lxml.html.clean import Cleaner code = '<tr id="ctl00_Content_AdManagementPreview_DetailView_divNova" class="Extended" style="display: none;">' cleaner = Cleaner(style=True, id=True, class=True) cleaned = cleaner.clean_html(code) print cleaned '<tr>'