lxml | 易学教程

get all the links of HTML using lxml

阅读更多关于 get all the links of HTML using lxml

问题 I want to find out all the urls and its name from a html page using lxml. I can parse the url and can find out this thing but is there any easy way from which I can find all the url links using lxml? 回答1: from lxml.html import parse dom = parse('http://www.google.com/').getroot() links = dom.cssselect('a') 回答2: from lxml import etree, cssselect, html with open("/you/path/index.html", "r") as f: fileread = f.read() dochtml = html.fromstring(fileread) select = cssselect.CSSSelector("a") links =

XHTML namespace issues with cssselect in lxml

阅读更多关于 XHTML namespace issues with cssselect in lxml

问题 I have problems using cssselect with a XHTML (or XML with namespace). Although the documentation says how to use namespace in csselect I do not understand it: cssselect namespaces My Input XHTML string: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Teststylesheet</title> <style type="text/css"> /*<![CDATA[*/ ol{margin:0;padding:0} /*]]>*/ </style> </head> <body> </body> <

使用XPath

阅读更多关于使用XPath

XPath----XML路径语言 XPath概览 XPath是一门在XML文档中查找信息的语言，它提供了非常简洁明了的路径选择表达式。 XPath常用规则表达式描述 nodename 选取此节点的所有子节点 / 从当前节点选取直接子节点 // 从当前节点选取子孙节点 . 选取当前节点 .. 选取当前节点的父节点 @ 选取属性示例： //title[@lang='eng'] 它代表选择所有名称为title，同时属性lang的值为eng的节点实例引入处理HTML变量 1 from lxml import etree 2 3 html = etree.parse('./test.html', etree.HTMLParser()) # 直接对html文本进行解析 4 result = etree.tostring(html) 5 print(result.decode('utf-8')) 6 7 8 # 输出： 9 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 10 <html><body><div> 11 <ul> 12 <li class="item-O"><a href="linkl.html"

Python lxml iterfind w/ namespace but prefix=None

阅读更多关于 Python lxml iterfind w/ namespace but prefix=None

问题 I want to perform iterfind() for elements which have a namespace but no prefix. I'd like to call iterfind([tagname]) or iterfind([tagname], [namespace dict]) I don't care to enter the tag as follows every time: "{%s}tagname" % tree.nsmap[None] Details I'm running through an xml response from a Google API. The root node defines several namespaces, including one for which there is no prefix: xmlns="http://www.w3.org/2005/Atom" It looks as though when I try to search through my etree, everything

parsing HTML table using python - HTMLparser or lxml

阅读更多关于 parsing HTML table using python - HTMLparser or lxml

问题 I have a html page which consist of a table & I want to fetch all the values in td, tr in that table. I have tried working with beautifulsoup but now i wanted to work on lxml or HML parser with python. I have attached the example. I want to fetch values as lists of tuple as [ [( value of 2050 jan, value of main subject-part1-sub part1-subject1 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject1 ),... ], [( value of 2050 jan, value of main subject-part1-sub part1-subject2 ),

using fromstring() with lxml prefixes

阅读更多关于 using fromstring() with lxml prefixes

I have a variable ele. I'm trying to append a child node onto ele that contains a namespace prefix (called style) in its tag. ele seems to be aware of this prefix, as the line: print(ele.nsmap['style']) outputs urn:oasis:names:tc:opendocument:xmlns:style:1.0 But when I try to run ele.append(etree.fromstring('<style:style />')) I get the error lxml.etree.XMLSyntaxError: Namespace prefix style on style is not defined What am I missing here? etree.fromstring('<style:style />') throws an error because <style:style /> is a small XML document that is not namespace-well-formed . You have to declare

python爬取中国天气网天气并保存为csv格式文件

阅读更多关于 python爬取中国天气网天气并保存为csv格式文件

python版本：python3.7 编译器：pycharm 所爬取的网址： http://www.weather.com.cn/weather/101020100.shtml （中国天气网上海）所用方法：lxml的css选择器 lxml的具体使用方法可以参照我另一篇博客： https://blog.csdn.net/qq_38929220/article/details/83623057 最后运行结果示例如图：爬取思路检查网站的robots.txt文件查看网页源代码找到所要爬取的内容写表达式爬取想要的内容写入csv文件检查网站的robots.txt文件 robots.txt文件定义了对爬虫的限制，可以直接手动在想要爬的网址后输入robots.txt查看例： http://www.weather.com.cn/robots.txt 也可以通过代码实现，这样在爬取其他网页时也可以复用，爬多网页时比较方便。 # 检查url传递的robots.txt限制 if rp . can_fetch ( user_agent , url ) : throttle . wait ( url ) #延迟函数 html = download ( url , headers , proxy = proxy , num_retries = num_retries )

python3+scrapy简单爬虫入门

阅读更多关于 python3+scrapy简单爬虫入门

安装python 1、到官网下载选择对应版本进行安装： https://www.python.org/downloads/release/python-364/ 如果使用压缩包的话还需要配置环境变量，安装包见下图 2、安装完后，在cmd命令行下输入python，若出现如图信息则表示安装成功安装scrapy 1、 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 是一个windows的编译好的Python第三方库，我们下载好对应自己Python版本的库即可。分别搜索pip、lxml、twisted、scrapy，找到对应版本下载，以lxml为例： lxml-4.1.1-cp36-cp36m-win_adm64.whl，表示lxml的版本为3.6，对应的python版本为3.6-64bit。如果不知道python版本的见上一步。安装命令：pip install lxml 其他的安装以此类推，出现successfully则表示安装成功。安装pywin32 1、scrapy安装成功后，还要安装pywin32，地址： https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/ 傻瓜式安装即可。至此准备工作差不多了，我们来进行一个简单实例。实例

Setting 'xml:space' to 'preserve' Python lxml

阅读更多关于 Setting 'xml:space' to 'preserve' Python lxml

问题 I have a text element within an SVG file that I'm generating using lxml . I want to preserve whitespace in this element. I create the text element and then attempt to .set() the xml:space to preserve but nothing I try seems to work. I'm probably missing something conceptually. Any ideas? 回答1: You can do it by explicitly specifying the namespace URI associated with the special xml: prefix (see http://www.w3.org/XML/1998/namespace). from lxml import etree root = etree.Element("root") root.set("

When parsing html why do I need item.text sometimes and item.text_content() others

阅读更多关于 When parsing html why do I need item.text sometimes and item.text_content() others

问题 Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not sure I see why yet. Any hints would be appreciated Okay I am not sure exactly how to provide an example without making you handle a file: here is some code I wrote to try to figure out why I was not getting some text I expected: theTree=html.fromstring(open(notmatched[0]).read()) text=[] text_content=[] notText=[] hasText=[]