lxml

Running Scrapy on PyPy

陌路散爱 提交于 2019-12-05 06:24:18
问题 Is it possible to run Scrapy on PyPy ? I've looked through the documentation and the github project, but the only place where PyPy is mentioned is that there were some unit tests being executed on PyPy 2 years ago, see PyPy support. There is also Scrapy fails in PyPy long discussion happened 3 years ago without a concrete resolution or a follow-up. From what I understand, the main Scrapy's dependency Twisted is known to work on PyPy. Scrapy also uses lxml for HTML parsing, which has a PyPy

Python Lxml - Append a existing xml with new data

孤街浪徒 提交于 2019-12-05 05:49:49
I am new to python/lxml After reading the lxml site and dive into python I could not find the solution to my n00b troubles. I have the below xml sample: --------------- <addressbook> <person> <name>Eric Idle</name> <phone type='fix'>999-999-999</phone> <phone type='mobile'>555-555-555</phone> <address> <street>12, spam road</street> <city>London</city> <zip>H4B 1X3</zip> </address> </person> </addressbook> ------------------------------- I am trying to append one child to the root element and write the entire file back out as a new xml or over write the existing xml. Currently all I am writing

How to get an XPath from selenium webelement or from lxml?

余生颓废 提交于 2019-12-05 05:49:19
I am using selenium and I need to find the XPaths of some selenium web elements. For example: import selenium.webdriver driver = selenium.webdriver.Firefox() element = driver.find_element_by_xpath(<some_xpath>) elements = element.find_elements_by_xpath(<some_relative_xpath>) for e in elements: print e.get_xpath() I know I can't get the XPath from the element itself, but is there a nice way to get it anyway? I tried using lxml to parse the HTML, but it doesn't recognize the XPath, <some_xpath> , I passed, even though driver.find_element_by_xpath(<some_xpath>) did manage to find that element.

python爬虫中XPath和lxml解析库

情到浓时终转凉″ 提交于 2019-12-05 04:57:26
什么是XML XML 指可扩展标记语言(EXtensible Markup Language) XML 是一种标记语言,很类似 HTML XML 的设计宗旨是传输数据,而非显示数据 XML 的标签需要我们自行定义。 XML 被设计为具有自我描述性。 XML 是 W3C 的推荐标准 W3School官方文档: http://www.w3school.com.cn/xml/index.asp XML 和 HTML 的区别 数据格式 描述 设计目标 XML Extensible Markup Language (可扩展标记语言) 被设计为传输和存储数据,其焦点是数据的内容。 HTML HyperText Markup Language (超文本标记语言) 显示数据以及如何更好显示数据。 HTML DOM Document Object Model for HTML (文档对象模型) 通过 HTML DOM,可以访问所有的 HTML 元素,连同它们所包含的文本和属性。可以对其中的内容进行修改和删除,同时也可以创建新的元素。 XML文档示例 <?xml version="1.0" encoding="utf-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author

python lxml etree applet information from yahoo

泄露秘密 提交于 2019-12-05 04:50:43
Yahoo finance updated their website. I had an lxml/etree script that used to extract the analyst recommendations. Now, however, the analyst recommendations are there, but only as a graphic. You can see an example on this page . The graph called Recommendation Trends on the right hand column shows the number of analyst reports showing strong buy, buy, hold, underperform, and sell. My guess is that yahoo will make a few adjustments to the page over the coming little while, but it got me wondering whether such data was extractable in any reasonable way? I mean, is there a way to get the graphic

BeautifulSoup:网页解析利器上手简介

主宰稳场 提交于 2019-12-05 04:22:53
关于爬虫的案例和方法,我们已讲过许多。不过在以往的文章中,大多是关注在 如何把网页上的内容抓取下来 。今天我们来分享下,当你已经把内容爬下来之后, 如何提取出其中你需要的具体信息 。 网页被抓取下来,通常就是 str 字符串类型的对象 ,要从里面寻找信息,最直接的想法就是直接通过字符串的 find 方法 和 切片操作 : s = '<p>价格:15.7 元</p>' start = s . find ( '价格:' ) end = s . find ( ' 元' ) print ( s [ start + 3 : end ]) # 15.7 这能应付一些极简单的情况,但只要稍稍复杂一点,这么写就会累死人。更通用的做法是使用 正则表达式 : import re s = '<p>价格:15.7 元</p>' r = re . search ( '[\d.]+' , s ) print ( r . group ()) # 15.7 正则表达式是处理文本解析的万金油,什么情况都可以应对。但可惜掌握它需要一定的学习成本, 原本我们有一个网页提取的问题,用了正则表达式,现在我们有了两个问题。 HTML 文档本身是 结构化的文本 ,有一定的规则,通过它的结构可以简化信息提取。于是,就有了 lxml、pyquery、BeautifulSoup 等网页信息提取库。一般我们会用这些库来提取网页信息

lxml: insert tag at a given position

半城伤御伤魂 提交于 2019-12-05 03:34:46
I have an xml file, similar to this: <tag attrib1='I'> <subtag1 subattrib1='1'> <subtext>text1</subtext> </subtag1> <subtag3 subattrib3='3'> <subtext>text3</subtext> </subtag3> </tag> I would like to insert a new subElement, so the result would be something like this <tag attrib1='I'> <subtag1 subattrib1='1'> <subtext>text1</subtext> </subtag1> <subtag2 subattrib2='2'> <subtext>text2</subtext> </subtag2> <subtag3 subattrib3='3'> <subtext>text3</subtext> </subtag3> </tag> I can append my xml file, but then the new elements will be inserted at the end. How can I force python lxml to put it into

Iterate over both text and elements in lxml etree

偶尔善良 提交于 2019-12-05 03:06:09
Suppose I have the following XML document: <species> Mammals: <dog/> <cat/> Reptiles: <snake/> <turtle/> Birds: <seagull/> <owl/> </species> Then I get the species element like this: import lxml.etree doc = lxml.etree.fromstring(xml) species = doc.xpath('/species')[0] Now I would like to print a list of animals grouped by species. How could I do it using ElementTree API? If you enumerate all of the nodes, you'll see a text node with the class followed by element nodes with the species: >>> for node in species.xpath("child::node()"): ... print type(node), node ... <class 'lxml.etree.

Encoding error while parsing RSS with lxml

时间秒杀一切 提交于 2019-12-05 02:34:05
I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError? request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml') response = urllib2.urlopen(request) response = response.read() encd = chardet.detect(response)['encoding'] parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd) tree = etree.parse(response, parser) But I get an error: tree = etree.parse(response, parser) File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594) File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c

How to debug lxml.etree.XSLTParseError: Invalid expression error

此生再无相见时 提交于 2019-12-05 00:53:06
问题 I'm trying to find out why lxml cannot parse an XSL document which consists of a "root" document with various xml:include s. I get an error: Traceback (most recent call last): File "s.py", line 10, in <module> xslt = ET.XSLT(ET.parse(d)) File "xslt.pxi", line 409, in lxml.etree.XSLT.__init__ (src/lxml/lxml.etree.c:151978) lxml.etree.XSLTParseError: Invalid expression That tells me where in the lxml source the error is, but is there a way to get more through lxml about where in the xsl the