lxml

beautifulsoup模块

随声附和 提交于 2019-12-05 17:18:29
一 介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找, 修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 pip install lxml #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib 下表列出了主要的解析器,以及它们的优缺点

Beautifulsoup模块基础详解

China☆狼群 提交于 2019-12-05 17:04:24
Beautifulsoup模块 官方中文文档 Beautifulsoup官方中文文档 介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip

Fully streaming XML parser

走远了吗. 提交于 2019-12-05 16:52:10
I'm trying to consume the Exchange GetAttachment webservice using requests , lxml and base64io . This service returns a base64-encoded file in a SOAP XML HTTP response. The file content is contained in a single line in a single XML element. GetAttachment is just an example, but the problem is more general. I would like to stream the decoded file contents directly to disk without storing the entire contents of the attachment in-memory at any point, since an attachment could be several 100 MB. I have tried something like this: r = requests.post('https://example.com/EWS/Exchange.asmx', data=...,

how to get the full contents of a node using xpath & lxml?

五迷三道 提交于 2019-12-05 16:44:35
I am using lxml's xpath function to retrieve parts of a webpage. I am trying to get contents of a <font> tag, which includes html tags of its own. If I use //td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"] I get the right amount of nodes, but they are returned as lxml objects ( <Element font at 0x101fe5eb0> ). If I use //td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text() I get exactly what I want, except that I don't get any of the HTML code which is contained within the <font> nodes. If I use //td[@valign="top"]/p[1]/font[

How to debug Python memory fault?

老子叫甜甜 提交于 2019-12-05 16:39:58
问题 Edit: Really appreciate help in finding bug - but since it might prove hard to find/reproduce, any general debug help would be greatly appreciated too! Help me help myself! =) Edit 2: Narrowing it down, commenting out code. Edit 3: Seems lxml might not be the culprit, thanks! The full script is here. I need to go over it looking for references. What do they look like? Edit 4: Actually, the scripts stops (goes 100%) in this, the parse_og part of it. So edit 3 is false - it must be lxml somehow

Iterate through all the rows in a table using python lxml xpath

让人想犯罪 __ 提交于 2019-12-05 16:04:06
This is the source code of the html page I want to extract data from. Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168 The table is at the bottom of the page <html> <table class="clCommonGrid" cellspacing="0"> <thead> <tr> <td colspan="3">Kommande matcher</td> </tr> <tr> <th style="width:1%;">Tid</th> <th style="width:69%;">Match</th> <th style="width:30%;">Arena</th> </tr> </thead> <tbody class="clGrid"> <tr class="clTrOdd"> <td nowrap="nowrap" class="no-line-through"> <span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span> </td> <td><a href="?scr=result&fmid

lxml/Python : get previous-sibling

眉间皱痕 提交于 2019-12-05 14:27:46
I have the following html: <div id = "big"> <span>header 1</span> <ul id = "outer"> <li id = "inner">aaa</li> <li id = "inner">bbb</li> </ul> <span>header 2</span> <ul id = "outer"> <li id = "inner">ccc</li> <li id = "inner">ddd</li> </ul> </div> I want it to loop it in the order: header 1 aaa bbb header 2 ccc ddd I have tried looping through each ul and then printing the header and the li values. However, I don't know how to get the span header that is associated with a ul. sets = tree.xpath("//div[@id='big']//ul[@id='outer']") for set in sets: # Print header. Not sure how to get it header =

xpath: How do we select just the very last text node?

久未见 提交于 2019-12-05 13:19:29
How do I select the globally-last text node using xpath? I tried this, but it gives me the last node in every context of the document. lxml.html.fromstring('1<a>2<b>3</b>4<c>5</c>6</a>').xpath('//text()[last()]') ['1', '3', '5', '6'] I can do this, but it's inefficient in both time and space, especially as the document gets large. lxml.html.fromstring('1<a>2<b>3</b>4<c>5</c>6</a>').xpath('//text()[last()]')[-1] '6' I tried to use an index of -1, but that gives me an empty list. I tried to use some of the reverse axes (so that I could index with 1), but I couldn't get them to work in a global

Python 3.4 lxml.etree: Start tag expected, '<' not found, line 1, column 1

浪尽此生 提交于 2019-12-05 13:03:41
Friends, As a novice at best, I have not been able to figure this out given what is available in forums. Ultimately, all I want to do is take some simple xml files and convert them all to CSV in one go (though this code is just for one at a time). It looks to me like there are no official name spaces, but I'm not sure. I have this code (I used one header, 'SubmittingSystemVendor', but I really want to write all of them to CSV: import csv import lxml.etree x = r'C:\Users\...\jh944.xml' with open('output.csv', 'w') as f: writer = csv.writer(f) writer.writerow('SubmittingSystemVendor') root =

Using python-amazon-product-api on Google Appengine without lxml [duplicate]

筅森魡賤 提交于 2019-12-05 13:03:15
This question already has answers here : Closed 7 years ago . Possible Duplicate: Amazon API library for Python? I'm wanting to use the python-amazon-product-api wrapper to access the Amazon API: http://pypi.python.org/pypi/python-amazon-product-api/ Unfortunately it relies on lxml which is not supported on Google Appengine. Does anyone know a workaround? I'm only looking to do basic stuff with the API so could I use Elementtree instead? I'm a newbie so using anything other than how it comes out of the box is still a challenge :) Thanks Tom You could try to use this fork. This is a minor fork