lxml | 易学教程

How to use lxml to grab specific parts of an XML document?

阅读更多关于 How to use lxml to grab specific parts of an XML document?

问题 I am using Amazon's API to receive information about books. I am trying to use lxml to extract specific parts of the XMl document that are needed for my application. I am not really sure how to use lxml, though. This is as far as I have gotten: root = etree.XML(response) To create a etree object for the XML document. Here is what the XML document looks like: http://pastebin.com/GziDkf1a There are actually multiple "Items", but I only pasted one of them to give you a specific example. For each

PYTHON : How to add root node to an XML

阅读更多关于 PYTHON : How to add root node to an XML

问题 I have an xml file looks something like this <A> <B> <C> .... </C> </B> </A> I want to add root on top of element 'A'. I found out a way to add elements to root. But How to change existing root and add on top of it using python. After adding root to the xml it should look like this <ROOT> <A> <B> <C> .... </C> </B> </A> </ROOT> 回答1: import lxml.etree as ET tree = ET.parse('data') root = tree.getroot() newroot = ET.Element("root") newroot.insert(0, root) print(ET.tostring(newroot, pretty_print

Unable to Install with easy_install or pip on mac

阅读更多关于 Unable to Install with easy_install or pip on mac

问题 I'm trying to install lxml and pycrypto modules using easy_install (and pip) but getting error messages like Running lxml-2.3.4/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kGsWMh/lxml-2.3.4/egg-dist-tmp-Gjqy3f Building lxml version 2.3.4. Building without Cython. Using build configuration of libxslt 1.1.24 In file included from /usr/include/limits.h:63, from /Developer/usr/bin/../lib/gcc/powerpc-apple-darwin10/4.0.1/include/limits.h:10, from /Library/Frameworks/Python.framework

lxml: get element with a particular child element?

阅读更多关于 lxml: get element with a particular child element?

问题 Working in lxml, I want to get the href attribute of all links with an img child that has title="Go to next page" . So in the following snippet: <a class="noborder" href="StdResults.aspx"> <img src="arrowr.gif" title="Go to next page"></img> </a> I'd like to get StdResults.aspx back. I've got this far: next_link = doc.xpath("//a/img[@title='Go to next page']") print next_link[0].attrib['href'] But next_link is the img , not the a tag - how can I get the a tag? Thanks. 回答1: Just change a/img..

Python: Convert Raw String to Bytes String without adding escape chraracters

阅读更多关于 Python: Convert Raw String to Bytes String without adding escape chraracters

问题 I have a string: 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' And I want: b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' But I keep getting: b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084' Context I scraped a string off of a webpage and stored it in the variable un . Now I want to decompress it

python lxml adds unused namespaces

阅读更多关于 python lxml adds unused namespaces

问题 I'm having an issue when using lxml's find() method to select a node in an xml file. Essentially I am trying to move a node from one xml file to another. File 1: <somexml xmlns:a='...' xmlns:b='...' xmlns:c='...'> <somenode id='foo'> <something>bar</something> </somenode> </somexml> Once I parse File 1 and do a find on it: node = tree.find('//*[@id="foo"]') Node looks like this: <somenode xmlns:a='...' xmlns:b='...' xmlns:c='...'> <something>bar</something> </somenode> Notice it added the

using fromstring() with lxml prefixes

阅读更多关于 using fromstring() with lxml prefixes

问题 I have a variable ele. I'm trying to append a child node onto ele that contains a namespace prefix (called style) in its tag. ele seems to be aware of this prefix, as the line: print(ele.nsmap['style']) outputs urn:oasis:names:tc:opendocument:xmlns:style:1.0 But when I try to run ele.append(etree.fromstring('<style:style />')) I get the error lxml.etree.XMLSyntaxError: Namespace prefix style on style is not defined What am I missing here? 回答1: etree.fromstring('<style:style />') throws an

Python网络爬虫（上）

阅读更多关于 Python网络爬虫（上）

Python网络爬虫（上）概述预备知识 1、如何处理包含大量 JavaScript（JS）的页面以及如何处理登录问题 2、screen scraping（网页抓屏）、data mining（数据挖掘）、web harvesting（网页收割）、网页抓取、web crawler（网络爬虫）、bot（网络机器人） 3、网页爬虫的优点：一、同时处理几千甚至几百万个网页；二、区别于传统搜索引擎，可以获取更加准确的数据信息；三、与 API 获取数据相比，网页爬虫灵活性更强 4、网页爬虫运用于：市场预测、机器语言翻译、医疗诊断领域、新闻网站、文章、健康论坛、宏观经济、生物基因、国际关系、健康论坛、艺术领域等方面数据获取和分析（分类和聚合） 5、网页爬虫涉及：数据库、网络服务器、HTTP协议、HTML语言（超文本标记语言 H yper T ext M arkup L anguage）、网络安全、图像处理、数据科学等反面知识 6、网页的组成：HTML文本格式层、CSS样式层（ C ascading S tyle S heets）、JavaScript 执行层、图像渲染层 7、 JavaScript思路：（1）借鉴C语言的基本语法、（2）借鉴Java语言的数据类型和内存管理、（3）借鉴Scheme语言，将函数提升到"第一等公民"（first class）的地位、（4）借鉴Self语言

How to remove namespace value from inside lxml.html.html5paser element tag

阅读更多关于 How to remove namespace value from inside lxml.html.html5paser element tag

问题 Is it possible not to add namespace for the tag when using html5parser from the lxml.html package? Example: from lxml import html print(html.parse('http://example.com').getroot().tag) # You will get 'html' from lxml.html import html5parser print(html5parser.parse('http://example.com').getroot().tag) # You will get '{http://www.w3.org/1999/xhtml}html' The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all? 回答1: There is a specific

trying to install lxml on max osx leopard

阅读更多关于 trying to install lxml on max osx leopard

问题 I have tried lots of different guides.. this one gets me the furthest.. CFLAGS="$CFLAGS -lgcrypt -fPIC" STATIC_DEPS=true easy_install-2.6 lxml however after installing all dependencies I get this error message over and over again: install-NRDNAB/lxml-2.3/build/tmp/libxml2/lib/pkgconfig" /usr/bin/install -c -m 644 libxslt.m4 '/private/tmp/easy_install-NRDNAB/lxml-2.3/build/tmp/libxml2/share/aclocal' /usr/bin/install -c -m 644 xsltConf.sh '/private/tmp/easy_install-NRDNAB/lxml-2.3/build/tmp