lxml | 易学教程

PYTHON 2.6 XML.ETREE to output single quote for attributes instead of double quote

阅读更多关于 PYTHON 2.6 XML.ETREE to output single quote for attributes instead of double quote

i got the following code : #!/usr/bin/python2.6 from lxml import etree n = etree.Element('test') n.set('id','1234') print etree.tostring(n) the output generate is <test id="1234"/> but i want <test id='1234'/> can someone help ? I checked the documentation and found no reference for single/double-quote option. I think your only recourse is print etree.tostring(n).replace('"', "'") Update Given: from lxml import etree n = etree.Element('test') n.set('id', "Zach's not-so-good answer") my original answer could output malformed XML because of unbalanced apostrophes: <test id='Zach's not-so-good

lxml cssselect Parsing

阅读更多关于 lxml cssselect Parsing

I have a document with the following data: <div class="ds-list"> <b>1. </b> A domesticated carnivorous mammal <i>(Canis familiaris)</i> related to the foxes and wolves and raised in a wide variety of breeds. </div> And I want to get everything within the class ds-list (without <b> and <i> tags). Currently my code is doc.cssselect('div.ds-list') , but all this picks up is the newline before the <b> . How can I get this to do what I want it to? Perhaps you are looking for the text_content method?: import lxml.html as lh content='''\ <div class="ds-list"> <b>1. </b> A domesticated carnivorous

Scraping new ESPN site using xpath [Python]

阅读更多关于 Scraping new ESPN site using xpath [Python]

问题 I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15: import requests import lxml.html from lxml.cssselect import CSSSelector doc = lxml.html.fromstring(requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405').text) #xpath print doc.xpath("//title/text()") #print page title print doc.xpath("//span/@time") print doc.xpath("//span[@class='time']") print doc.xpath("//span[@class='time']/text()") #CCS

import error due to bs4 vs BeautifulSoup

阅读更多关于 import error due to bs4 vs BeautifulSoup

问题 I am trying to use beautifulsoup compatible lxml and it is giving me an error: from lxml.html.soupparser import fromstring Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/lxml/html/soupparser.py", line 7, in <module> from BeautifulSoup import \ ImportError: No module named BeautifulSoup I have bs4 installed. How do I fix this issue? 回答1: The error is caused by soupparser.py trying to import BeautifulSoup version 3 while you have

How to install lxml for PyPy?

阅读更多关于 How to install lxml for PyPy?

I've created a virtualenv for PyPy with: virtualenv test -p `which pypy` source test/bin/activate I installed the following dependencies: sudo apt-get install python-dev libxml2 libxml2-dev libxslt-dev And then I run: pip install --upgrade pypy As a result I get a lot of errors looking like this: src/lxml/lxml.etree.c:234038:22: error: `PyThreadState` {aka struct _ts}` has no member named `use_tracing` How do I properly install lxml for PyPy 2.6.0? I used the following fork of lxml for PyPy instead: https://github.com/aglyzov/lxml/tree/cffi It can be installed with: pip install -e git+git:/

Problem using py2app with the lxml package

阅读更多关于 Problem using py2app with the lxml package

问题 I am trying to use 'py2app' to generate a standalone application from some Python scripts. The Python uses the 'lxml' package, and I've found that I have to specify this explicitly in the setup.py file that 'py2app' uses. However, the resulting application program still won't run on machines that haven't had 'lxml' installed. My Setup.py looks like this: from setuptools import setup OPTIONS = {'argv_emulation': True, 'packages' : ['lxml']} setup(app=[MyApp.py], data_files=[], options={'py2app

How to scrape this webpage with Python and lxml? empty list returned

阅读更多关于 How to scrape this webpage with Python and lxml? empty list returned

问题 For educational purposes, I'm trying to scrape this page gradually with Python and lxml, starting with movie names. From what I've read so far from the Python docs on lxml and the W3Schools on XPath, this code should yield me all movie titles in a list: from lxml import html import requests page = requests.get('http://www.rottentomatoes.com/browse/dvd-top-rentals/') tree = html.fromstring(page.text) movies = tree.xpath('//h3[@class="movieTitle"]/text()') print movies Basically, it should give

get all the links of HTML using lxml

阅读更多关于 get all the links of HTML using lxml

I want to find out all the urls and its name from a html page using lxml. I can parse the url and can find out this thing but is there any easy way from which I can find all the url links using lxml? from lxml.html import parse dom = parse('http://www.google.com/').getroot() links = dom.cssselect('a') from lxml import etree, cssselect, html with open("/you/path/index.html", "r") as f: fileread = f.read() dochtml = html.fromstring(fileread) select = cssselect.CSSSelector("a") links = [ el.get('href') for el in select(dochtml) ] links = iter(links) for n, l in enumerate(links): print n, l 来源：

Beautifulsoup模块基础用法详解

阅读更多关于 Beautifulsoup模块基础用法详解

目录 Beautifulsoup模块官方中文文档介绍基本使用遍历文档树搜索文档树五种过滤器 **find_all( name , attrs , recursive , text , **kwargs )** **find( name , attrs , recursive , text , **kwargs )** 其他方法 CSS选择器总结 Beautifulsoup模块官方中文文档 Beautifulsoup官方中文文档介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $

Iteratively parse a large XML file without using the DOM approach

阅读更多关于 Iteratively parse a large XML file without using the DOM approach

I have an xml file <temp> <email id="1" Body="abc"/> <email id="2" Body="fre"/> . . <email id="998349883487454359203" Body="hi"/> </temp> I want to read the xml file for each email tag. That is, at a time I want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on I tried to do this using DOM model for XML parsing, since my file size is 100 GB..the approach does not work. I then tried using: from xml.etree import ElementTree as ET tree=ET.parse('myfile.xml') root=ET.parse('myfile.xml').getroot() for i in root.findall('email/'): print i.get('Body'