lxml

PYTHON 2.6 XML.ETREE to output single quote for attributes instead of double quote

不想你离开。 提交于 2019-12-04 15:39:55
i got the following code : #!/usr/bin/python2.6 from lxml import etree n = etree.Element('test') n.set('id','1234') print etree.tostring(n) the output generate is <test id="1234"/> but i want <test id='1234'/> can someone help ? I checked the documentation and found no reference for single/double-quote option. I think your only recourse is print etree.tostring(n).replace('"', "'") Update Given: from lxml import etree n = etree.Element('test') n.set('id', "Zach's not-so-good answer") my original answer could output malformed XML because of unbalanced apostrophes: <test id='Zach's not-so-good

lxml cssselect Parsing

家住魔仙堡 提交于 2019-12-04 15:01:41
I have a document with the following data: <div class="ds-list"> <b>1. </b> A domesticated carnivorous mammal <i>(Canis familiaris)</i> related to the foxes and wolves and raised in a wide variety of breeds. </div> And I want to get everything within the class ds-list (without <b> and <i> tags). Currently my code is doc.cssselect('div.ds-list') , but all this picks up is the newline before the <b> . How can I get this to do what I want it to? Perhaps you are looking for the text_content method?: import lxml.html as lh content='''\ <div class="ds-list"> <b>1. </b> A domesticated carnivorous

Scraping new ESPN site using xpath [Python]

核能气质少年 提交于 2019-12-04 14:38:17
问题 I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15: import requests import lxml.html from lxml.cssselect import CSSSelector doc = lxml.html.fromstring(requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405').text) #xpath print doc.xpath("//title/text()") #print page title print doc.xpath("//span/@time") print doc.xpath("//span[@class='time']") print doc.xpath("//span[@class='time']/text()") #CCS

import error due to bs4 vs BeautifulSoup

时光怂恿深爱的人放手 提交于 2019-12-04 14:13:11
问题 I am trying to use beautifulsoup compatible lxml and it is giving me an error: from lxml.html.soupparser import fromstring Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/lxml/html/soupparser.py", line 7, in <module> from BeautifulSoup import \ ImportError: No module named BeautifulSoup I have bs4 installed. How do I fix this issue? 回答1: The error is caused by soupparser.py trying to import BeautifulSoup version 3 while you have

How to install lxml for PyPy?

亡梦爱人 提交于 2019-12-04 13:51:00
I've created a virtualenv for PyPy with: virtualenv test -p `which pypy` source test/bin/activate I installed the following dependencies: sudo apt-get install python-dev libxml2 libxml2-dev libxslt-dev And then I run: pip install --upgrade pypy As a result I get a lot of errors looking like this: src/lxml/lxml.etree.c:234038:22: error: `PyThreadState` {aka struct _ts}` has no member named `use_tracing` How do I properly install lxml for PyPy 2.6.0? I used the following fork of lxml for PyPy instead: https://github.com/aglyzov/lxml/tree/cffi It can be installed with: pip install -e git+git:/

Problem using py2app with the lxml package

廉价感情. 提交于 2019-12-04 13:14:36
问题 I am trying to use 'py2app' to generate a standalone application from some Python scripts. The Python uses the 'lxml' package, and I've found that I have to specify this explicitly in the setup.py file that 'py2app' uses. However, the resulting application program still won't run on machines that haven't had 'lxml' installed. My Setup.py looks like this: from setuptools import setup OPTIONS = {'argv_emulation': True, 'packages' : ['lxml']} setup(app=[MyApp.py], data_files=[], options={'py2app

How to scrape this webpage with Python and lxml? empty list returned

十年热恋 提交于 2019-12-04 12:11:59
问题 For educational purposes, I'm trying to scrape this page gradually with Python and lxml, starting with movie names. From what I've read so far from the Python docs on lxml and the W3Schools on XPath, this code should yield me all movie titles in a list: from lxml import html import requests page = requests.get('http://www.rottentomatoes.com/browse/dvd-top-rentals/') tree = html.fromstring(page.text) movies = tree.xpath('//h3[@class="movieTitle"]/text()') print movies Basically, it should give

get all the links of HTML using lxml

丶灬走出姿态 提交于 2019-12-04 12:10:01
I want to find out all the urls and its name from a html page using lxml. I can parse the url and can find out this thing but is there any easy way from which I can find all the url links using lxml? from lxml.html import parse dom = parse('http://www.google.com/').getroot() links = dom.cssselect('a') from lxml import etree, cssselect, html with open("/you/path/index.html", "r") as f: fileread = f.read() dochtml = html.fromstring(fileread) select = cssselect.CSSSelector("a") links = [ el.get('href') for el in select(dochtml) ] links = iter(links) for n, l in enumerate(links): print n, l 来源:

Beautifulsoup模块基础用法详解

爷,独闯天下 提交于 2019-12-04 12:05:50
目录 Beautifulsoup模块 官方中文文档 介绍 基本使用 遍历文档树 搜索文档树 五种过滤器 **find_all( name , attrs , recursive , text , **kwargs )** **find( name , attrs , recursive , text , **kwargs )** 其他方法 CSS选择器 总结 Beautifulsoup模块 官方中文文档 Beautifulsoup官方中文文档 介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $

Iteratively parse a large XML file without using the DOM approach

霸气de小男生 提交于 2019-12-04 11:50:45
I have an xml file <temp> <email id="1" Body="abc"/> <email id="2" Body="fre"/> . . <email id="998349883487454359203" Body="hi"/> </temp> I want to read the xml file for each email tag. That is, at a time I want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on I tried to do this using DOM model for XML parsing, since my file size is 100 GB..the approach does not work. I then tried using: from xml.etree import ElementTree as ET tree=ET.parse('myfile.xml') root=ET.parse('myfile.xml').getroot() for i in root.findall('email/'): print i.get('Body'