lxml | 易学教程

(still) cannot properly install lxml 2.3 for python, but at least 2.2.8 works

阅读更多关于 (still) cannot properly install lxml 2.3 for python, but at least 2.2.8 works

30 jun 2011 -- I am awarding @Pablo for this question, because of his answer . I am still unable to properly install lxml 2.3 for reasons discussed in his comments. I gather for a little bit of work I could, but I have already spent a ridiculous amount of time on this problem. I have, however, written the code I needed and successfully installed lxml 2.2.8. The code functions with this version. Better yet, Pablo was the only one to properly diagnose the error. Which was libxslt needed to be updated to a version with support for exsltMathXpathCtxtRegister I appreciate everyones help on this

AWS Lambda not importing LXML

阅读更多关于 AWS Lambda not importing LXML

I am trying to use the LXML module within AWS Lambda and having no luck. I downloaded LXML using the following command: pip install lxml -t folder To download it to my lambda function deployment package. I zipped the contents of my lambda function up as I have done with all other lambda functions, and uploaded it to AWS Lambda. However no matter what I try I get this error when I run the function: Unable to import module 'handler': /var/task/lxml/etree.so: undefined symbol: PyFPE_jbuf When I run it locally, I don't have an issues, it is simply when I run in on Lambda where this issue arises.

python爬虫实践——零基础快速入门（三）爬取豆瓣图书

阅读更多关于 python爬虫实践——零基础快速入门（三）爬取豆瓣图书

上一篇文章讲的是 python爬虫实践——零基础快速入门（二）爬取豆瓣电影，爬取豆瓣电影一页的信息。那想要爬取多个网页信息呢？那写代码就有点不够了。下面我们来爬取豆瓣TOP250图书信息，地址如下： https://book.douban.com/top250 我们要爬取哪些信息呢？如下图： 1.检查并复制《追风筝的人》书名的xpath如下： //*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a 我们按照同样套路来尝试一下： #-*- coding:utf-8 -*- import requests from lxml import etree import time url = 'https://book.douban.com/top250' data = requests . get ( url ) . text f = etree . HTML ( data ) books = f . xpath ( '//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a/@title' ) 我靠，什么情况，居然返回是空值？？？注意：浏览器复制的 xpath 信息并不是完全可靠的，浏览器经常会自己在里面增加多余的

Python爬虫入门 | 2 爬取豆瓣电影信息

阅读更多关于 Python爬虫入门 | 2 爬取豆瓣电影信息

这是一个适用于小白的Python爬虫免费教学课程，只有7节，让零基础的你初步了解爬虫，跟着课程内容能自己爬取资源。看着文章，打开电脑动手实践，平均45分钟就能学完一节，如果你愿意，今天内你就可以迈入爬虫的大门啦~ 好啦，正式开始我们的第二节课《爬取豆瓣电影信息》吧！啦啦哩啦啦，都看黑板~ 1. 爬虫原理 1.1 爬虫基本原理听了那么多的爬虫，到底什么是爬虫？爬虫又是如何工作的呢？我们先从“爬虫原理”说起。爬虫又称为网页蜘蛛，是一种程序或脚本。但重点在于：它能够按照一定的规则，自动获取网页信息。爬虫的通用框架如下： 1.挑选种子URL； 2.将这些URL放入待抓取的URL队列； 3.取出待抓取的URL，下载并存储进已下载网页库中。此外，将这些URL放入待抓取URL队列，进入下一循环； 4.分析已抓取队列中的URL，并且将URL放入待抓取URL队列，从而进入下一循环。咳咳~ 还是用一个具体的例子，来说明吧！ 1.2 一个爬虫例子爬虫获取网页信息和人工获取信息，其实原理是一致的，比如我们要获取电影的“评分”信息：人工操作步骤： 1. 获取电影信息的页面 2. 定位（找到）到评分信息的位置 3. 复制、保存我们想要的评分数据爬虫操作步骤： 1. 请求并下载电影页面信息 2. 解析并定位评分信息 3. 保存评分数据感觉是不是很像？ 1.3 爬虫的基本流程简单来说

python3.7使用etree遇到的问题

阅读更多关于 python3.7使用etree遇到的问题

使用python3.6时安装好lxml时按照许多网上的教程来引入会发现etree没被引入进来解决办法：一、import lxml.html etree = lxml.html.etree 这样就可以使用etree了二、修改lxml的版本为4.2.5 忽略报错！文章来源以下链接： https://blog.csdn.net/weixin_42670402/article/details/82385716 来源： https://www.cnblogs.com/shaozhihao/p/11582385.html

How to use lxml to find an element by text?

阅读更多关于 How to use lxml to find an element by text?

问题 Assume we have the following html: <html> <body> <a href="/1234.html">TEXT A</a> <a href="/3243.html">TEXT B</a> <a href="/7445.html">TEXT C</a> <body> </html> How do I make it find the element "a", which contains "TEXT A"? So far I've got: root = lxml.hmtl.document_fromstring(the_html_above) e = root.find('.//a') I've tried: e = root.find('.//a[@text="TEXT A"]') but that didn't work, as the "a" tags have no attribute "text". Is there any way I can solve this in a similar fashion to what I've

XPathEvalError: Unregistered function for matches() in lxml

阅读更多关于 XPathEvalError: Unregistered function for matches() in lxml

i am trying to use the following xpath query in python from lxml.html.soupparser import fromstring root = fromstring(inString) nodes = root.xpath(".//p3[matches(.,'ABC')]//preceding::p2//p3") but it gives me the error nodes = root.xpath(".//p3[matches(.,'ABC')]//preceding::p2//p3") File "lxml.etree.pyx", line 1507, in lxml.etree._Element.xpath (src\lxml\lxml.etree.c:52198) File "xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:152124) File "xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:151097) File "xpath

Python lxml XPath problem

阅读更多关于 Python lxml XPath problem

问题 I'm trying to print/save a certain element's HTML from a web-page. I've retrieved the requested element's XPath from firebug. All I wish is to save this element to a file. I don't seem to succeed in doing so. (tried the XPath with and without a /text() at the end) I would appreciate any help, or past experience. 10x, David import urllib2,StringIO from lxml import etree url='http://www.tutiempo.net/en/Climate/Londres_Heathrow_Airport/12-2009/37720.htm' seite = urllib2.urlopen(url) html = seite

Filtering out certain bytes in python

阅读更多关于 Filtering out certain bytes in python

I'm getting this error in my python program: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters This question, random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes , explains the issue. The solution was to filter out certain bytes, but I'm confused about how to go about doing this. Any help? Edit: sorry if i didn't give enough info about the problem. the string data comes from an external api query of which i have no control over the how the data is formatted. John Machin As

Iteratively parsing HTML (with lxml?)

阅读更多关于 Iteratively parsing HTML (with lxml?)

问题 I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as: lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59 This then causes everything to stop. Is there a way to iteratively parse HTML without choking on syntax errors? At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document,