lxml | 易学教程

How to select following sibling/xml tag using xpath

阅读更多关于 How to select following sibling/xml tag using xpath

I have an HTML file (from Newegg) and their HTML is organized like below. All of the data in their specifications table is ' desc ' while the titles of each section are in ' name. ' Below are two examples of data from Newegg pages. <tr> <td class="name">Brand</td> <td class="desc">Intel</td> </tr> <tr> <td class="name">Series</td> <td class="desc">Core i5</td> </tr> <tr> <td class="name">Cores</td> <td class="desc">4</td> </tr> <tr> <td class="name">Socket</td> <td class="desc">LGA 1156</td> <tr> <td class="name">Brand</td> <td class="desc">AMD</td> </tr> <tr> <td class="name">Series</td> <td

src/lxml/etree_defs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory

阅读更多关于 src/lxml/etree_defs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory

问题 I am running the following comand for installing the packages in that file " pip install -r requirements.txt --download-cache=~/tmp/pip-cache ". requirement.txt contains pacakages like # Data formats # ------------ PIL==1.1.7 # html5lib==0.90 httplib2==0.7.4 lxml==2.3.1 # Documentation # ------------- Sphinx==1.1 docutils==0.8.1 # Testing # ------- behave==1.1.0 dingus==0.3.2 django-testscenarios==0.7.2 mechanize==0.2.5 mock==0.7.2 testscenarios==0.2 testtools==0.9.14 wsgi_intercept==0.5.1

Equivalent to InnerHTML when using lxml.html to parse HTML

阅读更多关于 Equivalent to InnerHTML when using lxml.html to parse HTML

问题 I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag. <body> <h1>A title</h1> <p>Some text</p> </body> InnerHtml is therefore: <h1>A title</h1> <p>Some text</p> I can do it using hacks (converting to string

Beautifulsoup模块基础详解

阅读更多关于 Beautifulsoup模块基础详解

Beautifulsoup模块官方中文文档 Beautifulsoup官方中文文档介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip

Extracting lxml xpath for html table

阅读更多关于 Extracting lxml xpath for html table

问题 I have a html doc similar to following: <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"> <div id="Symbols" class="cb"> <table class="quotes"> <tr><th>Code</th><th>Name</th> <th style="text-align:right;">High</th> <th style="text-align:right;">Low</th> </tr> <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;"> <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td> <td>A Inc.</td> <td align="right">45.44</td> <td align="right">44.26</td

Why is lxml.etree.iterparse() eating up all my memory?

阅读更多关于 Why is lxml.etree.iterparse() eating up all my memory?

问题 This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with iterparse() ? import lxml.etree for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'): print "why does this consume all my memory?" I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. 回答1: As

Parsing broken XML with lxml.etree.iterparse

阅读更多关于 Parsing broken XML with lxml.etree.iterparse

问题 I'm trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). Unfortunately, the file contains some bad ascii characters that break the default parser. The parser works if I set recover=True, but the iterparse method doesn't take the recover parameter or a custom parser object. Does anyone know how to use iterparse to parse broken xml? #this works, but loads the whole file into memory parser = lxml.etree

Python pretty XML printer with lxml

阅读更多关于 Python pretty XML printer with lxml

问题 After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True) . I have the following XML: <testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests"> <testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306"> .... And I use it like this: tree = etree.parse('original.xml') root = tree.getroot() ... # modifications ... with open(FILE_NAME,

datawhale爬虫task02

阅读更多关于 datawhale爬虫task02

2.1 学习beautifulsoup 学习beautifulsoup，并使用beautifulsoup提取内容。使用beautifulsoup提取丁香园论坛的回复内容。 2.2学习xpath 学习xpath，使用lxml+xpath提取内容。使用xpath提取丁香园论坛的回复内容。一、学习beautifulsoup： 1.简介： BeautifulSoup是一个Python的HTML和XML的解析库，用来从网页中提取数据。 BeautifulSoup会自动将文档转换为Unicode编码，输出文档转换为UTF-8编码。导入BeautifulSoup方法：from bs4 import BeautifulSoup 中文文档地址： https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 2.解析器：推荐使用lxml解析器，如果使用lxml解析器，则在创建BeautifulSoup对象的时候，第二个参数填：lxml eg: from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello</p>', 'lxml') 3.基本使用： 1 html = """ 2 <html><head><title>The Dormouse's story<

lxml etree xmlparser remove unwanted namespace

阅读更多关于 lxml etree xmlparser remove unwanted namespace

问题 I have an xml doc that I am trying to parse using Etree.lxml <Envelope xmlns=\"http://www.example.com/zzz/yyy\"> <Header> <Version>1</Version> </Header> <Body> some stuff <Body> <Envelope> My code is: path = \"path to xml file\" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot() When I try to get dom.getroot() I get: <Element {http://www.example.com/zzz/yyy}Envelope at 28adacac> However I only want: <Element Envelope at 28adacac> When