lxml

get all the links of HTML using lxml

对着背影说爱祢 提交于 2019-12-09 18:48:05
问题 I want to find out all the urls and its name from a html page using lxml. I can parse the url and can find out this thing but is there any easy way from which I can find all the url links using lxml? 回答1: from lxml.html import parse dom = parse('http://www.google.com/').getroot() links = dom.cssselect('a') 回答2: from lxml import etree, cssselect, html with open("/you/path/index.html", "r") as f: fileread = f.read() dochtml = html.fromstring(fileread) select = cssselect.CSSSelector("a") links =

XHTML namespace issues with cssselect in lxml

﹥>﹥吖頭↗ 提交于 2019-12-09 13:46:24
问题 I have problems using cssselect with a XHTML (or XML with namespace). Although the documentation says how to use namespace in csselect I do not understand it: cssselect namespaces My Input XHTML string: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Teststylesheet</title> <style type="text/css"> /*<![CDATA[*/ ol{margin:0;padding:0} /*]]>*/ </style> </head> <body> </body> <

使用XPath

喜欢而已 提交于 2019-12-09 12:26:58
XPath----XML路径语言 XPath概览 XPath是一门在XML文档中查找信息的语言,它提供了非常简洁明了的路径选择表达式。 XPath常用规则 表达式 描 述 nodename 选取此节点的所有子节点 / 从当前节点选取直接子节点 // 从当前节点选取子孙节点 . 选取当前节点 .. 选取当前节点的父节点 @ 选取属性 示例: //title[@lang='eng'] 它代表选择所有名称为title,同时属性lang的值为eng的节点 实例引入 处理HTML变量 1 from lxml import etree 2 3 html = etree.parse('./test.html', etree.HTMLParser()) # 直接对html文本进行解析 4 result = etree.tostring(html) 5 print(result.decode('utf-8')) 6 7 8 # 输出: 9 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 10 <html><body><div> 11 <ul> 12 <li class="item-O"><a href="linkl.html"

Python lxml iterfind w/ namespace but prefix=None

橙三吉。 提交于 2019-12-09 11:56:44
问题 I want to perform iterfind() for elements which have a namespace but no prefix. I'd like to call iterfind([tagname]) or iterfind([tagname], [namespace dict]) I don't care to enter the tag as follows every time: "{%s}tagname" % tree.nsmap[None] Details I'm running through an xml response from a Google API. The root node defines several namespaces, including one for which there is no prefix: xmlns="http://www.w3.org/2005/Atom" It looks as though when I try to search through my etree, everything

parsing HTML table using python - HTMLparser or lxml

好久不见. 提交于 2019-12-09 09:31:59
问题 I have a html page which consist of a table & I want to fetch all the values in td, tr in that table. I have tried working with beautifulsoup but now i wanted to work on lxml or HML parser with python. I have attached the example. I want to fetch values as lists of tuple as [ [( value of 2050 jan, value of main subject-part1-sub part1-subject1 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject1 ),... ], [( value of 2050 jan, value of main subject-part1-sub part1-subject2 ),

using fromstring() with lxml prefixes

﹥>﹥吖頭↗ 提交于 2019-12-09 03:06:27
I have a variable ele. I'm trying to append a child node onto ele that contains a namespace prefix (called style) in its tag. ele seems to be aware of this prefix, as the line: print(ele.nsmap['style']) outputs urn:oasis:names:tc:opendocument:xmlns:style:1.0 But when I try to run ele.append(etree.fromstring('<style:style />')) I get the error lxml.etree.XMLSyntaxError: Namespace prefix style on style is not defined What am I missing here? etree.fromstring('<style:style />') throws an error because <style:style /> is a small XML document that is not namespace-well-formed . You have to declare

python爬取中国天气网天气并保存为csv格式文件

浪子不回头ぞ 提交于 2019-12-08 21:50:53
python版本:python3.7 编译器:pycharm 所爬取的网址: http://www.weather.com.cn/weather/101020100.shtml (中国天气网上海) 所用方法:lxml的css选择器 lxml的具体使用方法可以参照我另一篇博客: https://blog.csdn.net/qq_38929220/article/details/83623057 最后运行结果示例如图: 爬取思路 检查网站的robots.txt文件 查看网页源代码找到所要爬取的内容 写表达式爬取想要的内容 写入csv文件 检查网站的robots.txt文件 robots.txt文件定义了 对爬虫的限制 ,可以直接手动在想要爬的网址后输入robots.txt查看 例: http://www.weather.com.cn/robots.txt 也可以通过代码实现,这样在爬取其他网页时也可以复用,爬多网页时比较方便。 # 检查url传递的robots.txt限制 if rp . can_fetch ( user_agent , url ) : throttle . wait ( url ) #延迟函数 html = download ( url , headers , proxy = proxy , num_retries = num_retries )

python3+scrapy简单爬虫入门

一个人想着一个人 提交于 2019-12-08 21:29:58
安装python 1、到官网下载选择对应版本进行安装: https://www.python.org/downloads/release/python-364/ 如果使用压缩包的话还需要配置环境变量,安装包见下图 2、安装完后,在cmd命令行下输入python,若出现如图信息则表示安装成功 安装scrapy 1、 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 是一个windows的编译好的Python第三方库,我们下载好对应自己Python版本的库即可。分别搜索pip、lxml、twisted、scrapy,找到对应版本下载,以lxml为例: lxml-4.1.1-cp36-cp36m-win_adm64.whl,表示lxml的版本为3.6,对应的python版本为3.6-64bit。如果不知道python版本的见上一步。 安装命令:pip install lxml 其他的安装以此类推,出现successfully则表示安装成功。 安装pywin32 1、scrapy安装成功后,还要安装pywin32,地址: https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/ 傻瓜式安装即可。 至此准备工作差不多了,我们来进行一个简单实例。 实例

Setting 'xml:space' to 'preserve' Python lxml

痴心易碎 提交于 2019-12-08 17:24:52
问题 I have a text element within an SVG file that I'm generating using lxml . I want to preserve whitespace in this element. I create the text element and then attempt to .set() the xml:space to preserve but nothing I try seems to work. I'm probably missing something conceptually. Any ideas? 回答1: You can do it by explicitly specifying the namespace URI associated with the special xml: prefix (see http://www.w3.org/XML/1998/namespace). from lxml import etree root = etree.Element("root") root.set("

When parsing html why do I need item.text sometimes and item.text_content() others

南楼画角 提交于 2019-12-08 17:00:45
问题 Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not sure I see why yet. Any hints would be appreciated Okay I am not sure exactly how to provide an example without making you handle a file: here is some code I wrote to try to figure out why I was not getting some text I expected: theTree=html.fromstring(open(notmatched[0]).read()) text=[] text_content=[] notText=[] hasText=[]