lxml

How to select following sibling/xml tag using xpath

岁酱吖の 提交于 2019-11-26 15:22:23
I have an HTML file (from Newegg) and their HTML is organized like below. All of the data in their specifications table is ' desc ' while the titles of each section are in ' name. ' Below are two examples of data from Newegg pages. <tr> <td class="name">Brand</td> <td class="desc">Intel</td> </tr> <tr> <td class="name">Series</td> <td class="desc">Core i5</td> </tr> <tr> <td class="name">Cores</td> <td class="desc">4</td> </tr> <tr> <td class="name">Socket</td> <td class="desc">LGA 1156</td> <tr> <td class="name">Brand</td> <td class="desc">AMD</td> </tr> <tr> <td class="name">Series</td> <td

src/lxml/etree_defs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory

南楼画角 提交于 2019-11-26 15:16:21
问题 I am running the following comand for installing the packages in that file " pip install -r requirements.txt --download-cache=~/tmp/pip-cache ". requirement.txt contains pacakages like # Data formats # ------------ PIL==1.1.7 # html5lib==0.90 httplib2==0.7.4 lxml==2.3.1 # Documentation # ------------- Sphinx==1.1 docutils==0.8.1 # Testing # ------- behave==1.1.0 dingus==0.3.2 django-testscenarios==0.7.2 mechanize==0.2.5 mock==0.7.2 testscenarios==0.2 testtools==0.9.14 wsgi_intercept==0.5.1

Equivalent to InnerHTML when using lxml.html to parse HTML

半腔热情 提交于 2019-11-26 14:19:45
问题 I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed. I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag. <body> <h1>A title</h1> <p>Some text</p> </body> InnerHtml is therefore: <h1>A title</h1> <p>Some text</p> I can do it using hacks (converting to string

Beautifulsoup模块基础详解

六眼飞鱼酱① 提交于 2019-11-26 14:09:10
Beautifulsoup模块 官方中文文档 Beautifulsoup官方中文文档 介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip

Extracting lxml xpath for html table

烂漫一生 提交于 2019-11-26 14:08:28
问题 I have a html doc similar to following: <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"> <div id="Symbols" class="cb"> <table class="quotes"> <tr><th>Code</th><th>Name</th> <th style="text-align:right;">High</th> <th style="text-align:right;">Low</th> </tr> <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;"> <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td> <td>A Inc.</td> <td align="right">45.44</td> <td align="right">44.26</td

Why is lxml.etree.iterparse() eating up all my memory?

半世苍凉 提交于 2019-11-26 13:44:54
问题 This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with iterparse() ? import lxml.etree for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'): print "why does this consume all my memory?" I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. 回答1: As

Parsing broken XML with lxml.etree.iterparse

江枫思渺然 提交于 2019-11-26 13:18:46
问题 I'm trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). Unfortunately, the file contains some bad ascii characters that break the default parser. The parser works if I set recover=True, but the iterparse method doesn't take the recover parameter or a custom parser object. Does anyone know how to use iterparse to parse broken xml? #this works, but loads the whole file into memory parser = lxml.etree

Python pretty XML printer with lxml

流过昼夜 提交于 2019-11-26 13:17:50
问题 After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True) . I have the following XML: <testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests"> <testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306"> .... And I use it like this: tree = etree.parse('original.xml') root = tree.getroot() ... # modifications ... with open(FILE_NAME,

datawhale爬虫task02

瘦欲@ 提交于 2019-11-26 13:05:37
2.1 学习beautifulsoup 学习beautifulsoup,并使用beautifulsoup提取内容。 使用beautifulsoup提取丁香园论坛的回复内容。 2.2学习xpath 学习xpath,使用lxml+xpath提取内容。 使用xpath提取丁香园论坛的回复内容。 一、学习beautifulsoup: 1.简介: BeautifulSoup是一个Python的HTML和XML的解析库,用来从网页中提取数据。 BeautifulSoup会自动将文档转换为Unicode编码,输出文档转换为UTF-8编码。 导入BeautifulSoup方法:from bs4 import BeautifulSoup 中文文档地址: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 2.解析器: 推荐使用lxml解析器,如果使用lxml解析器,则在创建BeautifulSoup对象的时候,第二个参数填:lxml eg: from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello</p>', 'lxml') 3.基本使用: 1 html = """ 2 <html><head><title>The Dormouse's story<

lxml etree xmlparser remove unwanted namespace

无人久伴 提交于 2019-11-26 12:23:17
问题 I have an xml doc that I am trying to parse using Etree.lxml <Envelope xmlns=\"http://www.example.com/zzz/yyy\"> <Header> <Version>1</Version> </Header> <Body> some stuff <Body> <Envelope> My code is: path = \"path to xml file\" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot() When I try to get dom.getroot() I get: <Element {http://www.example.com/zzz/yyy}Envelope at 28adacac> However I only want: <Element Envelope at 28adacac> When