lxml

(still) cannot properly install lxml 2.3 for python, but at least 2.2.8 works

帅比萌擦擦* 提交于 2019-11-30 14:36:26
30 jun 2011 -- I am awarding @Pablo for this question, because of his answer . I am still unable to properly install lxml 2.3 for reasons discussed in his comments. I gather for a little bit of work I could, but I have already spent a ridiculous amount of time on this problem. I have, however, written the code I needed and successfully installed lxml 2.2.8. The code functions with this version. Better yet, Pablo was the only one to properly diagnose the error. Which was libxslt needed to be updated to a version with support for exsltMathXpathCtxtRegister I appreciate everyones help on this

AWS Lambda not importing LXML

冷暖自知 提交于 2019-11-30 13:52:19
I am trying to use the LXML module within AWS Lambda and having no luck. I downloaded LXML using the following command: pip install lxml -t folder To download it to my lambda function deployment package. I zipped the contents of my lambda function up as I have done with all other lambda functions, and uploaded it to AWS Lambda. However no matter what I try I get this error when I run the function: Unable to import module 'handler': /var/task/lxml/etree.so: undefined symbol: PyFPE_jbuf When I run it locally, I don't have an issues, it is simply when I run in on Lambda where this issue arises.

python爬虫实践——零基础快速入门(三)爬取豆瓣图书

偶尔善良 提交于 2019-11-30 13:34:33
上一篇文章讲的是 python爬虫实践——零基础快速入门(二)爬取豆瓣电影 ,爬取豆瓣电影一页的信息。那想要爬取多个网页信息呢?那写代码就有点不够了。 下面我们来爬取 豆瓣TOP250图书信息 ,地址如下: https://book.douban.com/top250 我们要爬取哪些信息呢?如下图: 1.检查并复制《追风筝的人》书名的xpath如下: //*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a 我们按照同样套路来尝试一下: #-*- coding:utf-8 -*- import requests from lxml import etree import time url = 'https://book.douban.com/top250' data = requests . get ( url ) . text f = etree . HTML ( data ) books = f . xpath ( '//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a/@title' ) 我靠,什么情况,居然返回是空值??? 注意:浏览器复制的 xpath 信息并不是完全 可靠的,浏览器经常会自己在里面增加多余的

Python爬虫入门 | 2 爬取豆瓣电影信息

馋奶兔 提交于 2019-11-30 13:32:41
这是一个适用于小白的Python爬虫免费教学课程,只有7节,让零基础的你初步了解爬虫,跟着课程内容能自己爬取资源。看着文章,打开电脑动手实践,平均45分钟就能学完一节,如果你愿意,今天内你就可以迈入爬虫的大门啦~ 好啦,正式开始我们的第二节课《爬取豆瓣电影信息》吧!啦啦哩啦啦,都看黑板~ 1. 爬虫原理 1.1 爬虫基本原理 听了那么多的爬虫,到底什么是爬虫?爬虫又是如何工作的呢?我们先从“爬虫原理”说起。 爬虫又称为网页蜘蛛,是一种程序或脚本。但重点在于:它能够按照一定的规则,自动获取网页信息。爬虫的通用框架如下: 1.挑选种子URL; 2.将这些URL放入待抓取的URL队列; 3.取出待抓取的URL,下载并存储进已下载网页库中。此外,将这些URL放入待抓取URL队列,进入下一循环; 4.分析已抓取队列中的URL,并且将URL放入待抓取URL队列,从而进入下一循环。 咳咳~ 还是用一个具体的例子,来说明吧! 1.2 一个爬虫例子 爬虫获取网页信息和人工获取信息,其实原理是一致的,比如我们要获取电影的“评分”信息: 人工操作步骤: 1. 获取电影信息的页面 2. 定位(找到)到评分信息的位置 3. 复制、保存我们想要的评分数据 爬虫操作步骤: 1. 请求并下载电影页面信息 2. 解析并定位评分信息 3. 保存评分数据 感觉是不是很像? 1.3 爬虫的基本流程 简单来说

python3.7使用etree遇到的问题

余生长醉 提交于 2019-11-30 11:53:38
使用python3.6时安装好lxml时按照许多网上的教程来引入会发现etree没被引入进来 解决办法: 一、import lxml.html etree = lxml.html.etree 这样就可以使用etree了 二、 修改lxml的版本为4.2.5 忽略报错! 文章来源以下链接: https://blog.csdn.net/weixin_42670402/article/details/82385716 来源: https://www.cnblogs.com/shaozhihao/p/11582385.html

How to use lxml to find an element by text?

柔情痞子 提交于 2019-11-30 11:51:11
问题 Assume we have the following html: <html> <body> <a href="/1234.html">TEXT A</a> <a href="/3243.html">TEXT B</a> <a href="/7445.html">TEXT C</a> <body> </html> How do I make it find the element "a", which contains "TEXT A"? So far I've got: root = lxml.hmtl.document_fromstring(the_html_above) e = root.find('.//a') I've tried: e = root.find('.//a[@text="TEXT A"]') but that didn't work, as the "a" tags have no attribute "text". Is there any way I can solve this in a similar fashion to what I've

XPathEvalError: Unregistered function for matches() in lxml

白昼怎懂夜的黑 提交于 2019-11-30 10:06:11
i am trying to use the following xpath query in python from lxml.html.soupparser import fromstring root = fromstring(inString) nodes = root.xpath(".//p3[matches(.,'ABC')]//preceding::p2//p3") but it gives me the error nodes = root.xpath(".//p3[matches(.,'ABC')]//preceding::p2//p3") File "lxml.etree.pyx", line 1507, in lxml.etree._Element.xpath (src\lxml\lxml.etree.c:52198) File "xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:152124) File "xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:151097) File "xpath

Python lxml XPath problem

末鹿安然 提交于 2019-11-30 09:16:07
问题 I'm trying to print/save a certain element's HTML from a web-page. I've retrieved the requested element's XPath from firebug. All I wish is to save this element to a file. I don't seem to succeed in doing so. (tried the XPath with and without a /text() at the end) I would appreciate any help, or past experience. 10x, David import urllib2,StringIO from lxml import etree url='http://www.tutiempo.net/en/Climate/Londres_Heathrow_Airport/12-2009/37720.htm' seite = urllib2.urlopen(url) html = seite

Filtering out certain bytes in python

落爺英雄遲暮 提交于 2019-11-30 08:45:32
I'm getting this error in my python program: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters This question, random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes , explains the issue. The solution was to filter out certain bytes, but I'm confused about how to go about doing this. Any help? Edit: sorry if i didn't give enough info about the problem. the string data comes from an external api query of which i have no control over the how the data is formatted. John Machin As

Iteratively parsing HTML (with lxml?)

佐手、 提交于 2019-11-30 08:40:34
问题 I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as: lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59 This then causes everything to stop. Is there a way to iteratively parse HTML without choking on syntax errors? At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document,