lxml

How to use lxml to grab specific parts of an XML document?

房东的猫 提交于 2019-12-08 06:58:37
问题 I am using Amazon's API to receive information about books. I am trying to use lxml to extract specific parts of the XMl document that are needed for my application. I am not really sure how to use lxml, though. This is as far as I have gotten: root = etree.XML(response) To create a etree object for the XML document. Here is what the XML document looks like: http://pastebin.com/GziDkf1a There are actually multiple "Items", but I only pasted one of them to give you a specific example. For each

PYTHON : How to add root node to an XML

淺唱寂寞╮ 提交于 2019-12-08 06:36:01
问题 I have an xml file looks something like this <A> <B> <C> .... </C> </B> </A> I want to add root on top of element 'A'. I found out a way to add elements to root. But How to change existing root and add on top of it using python. After adding root to the xml it should look like this <ROOT> <A> <B> <C> .... </C> </B> </A> </ROOT> 回答1: import lxml.etree as ET tree = ET.parse('data') root = tree.getroot() newroot = ET.Element("root") newroot.insert(0, root) print(ET.tostring(newroot, pretty_print

Unable to Install with easy_install or pip on mac

核能气质少年 提交于 2019-12-08 06:22:41
问题 I'm trying to install lxml and pycrypto modules using easy_install (and pip) but getting error messages like Running lxml-2.3.4/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kGsWMh/lxml-2.3.4/egg-dist-tmp-Gjqy3f Building lxml version 2.3.4. Building without Cython. Using build configuration of libxslt 1.1.24 In file included from /usr/include/limits.h:63, from /Developer/usr/bin/../lib/gcc/powerpc-apple-darwin10/4.0.1/include/limits.h:10, from /Library/Frameworks/Python.framework

lxml: get element with a particular child element?

大城市里の小女人 提交于 2019-12-08 06:17:08
问题 Working in lxml, I want to get the href attribute of all links with an img child that has title="Go to next page" . So in the following snippet: <a class="noborder" href="StdResults.aspx"> <img src="arrowr.gif" title="Go to next page"></img> </a> I'd like to get StdResults.aspx back. I've got this far: next_link = doc.xpath("//a/img[@title='Go to next page']") print next_link[0].attrib['href'] But next_link is the img , not the a tag - how can I get the a tag? Thanks. 回答1: Just change a/img..

Python: Convert Raw String to Bytes String without adding escape chraracters

坚强是说给别人听的谎言 提交于 2019-12-08 06:01:34
问题 I have a string: 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' And I want: b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' But I keep getting: b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084' Context I scraped a string off of a webpage and stored it in the variable un . Now I want to decompress it

python lxml adds unused namespaces

我的未来我决定 提交于 2019-12-08 05:47:35
问题 I'm having an issue when using lxml's find() method to select a node in an xml file. Essentially I am trying to move a node from one xml file to another. File 1: <somexml xmlns:a='...' xmlns:b='...' xmlns:c='...'> <somenode id='foo'> <something>bar</something> </somenode> </somexml> Once I parse File 1 and do a find on it: node = tree.find('//*[@id="foo"]') Node looks like this: <somenode xmlns:a='...' xmlns:b='...' xmlns:c='...'> <something>bar</something> </somenode> Notice it added the

using fromstring() with lxml prefixes

一笑奈何 提交于 2019-12-08 05:35:58
问题 I have a variable ele. I'm trying to append a child node onto ele that contains a namespace prefix (called style) in its tag. ele seems to be aware of this prefix, as the line: print(ele.nsmap['style']) outputs urn:oasis:names:tc:opendocument:xmlns:style:1.0 But when I try to run ele.append(etree.fromstring('<style:style />')) I get the error lxml.etree.XMLSyntaxError: Namespace prefix style on style is not defined What am I missing here? 回答1: etree.fromstring('<style:style />') throws an

Python网络爬虫(上)

帅比萌擦擦* 提交于 2019-12-08 05:02:36
Python网络爬虫(上) 概述 预备知识 1、如何处理包含大量 JavaScript(JS)的页面以及如何处理登录问题 2、screen scraping(网页抓屏)、data mining(数据挖掘)、web harvesting(网页收割)、网页抓取、web crawler(网络爬虫)、bot(网络机器人) 3、网页爬虫的优点:一、同时处理几千甚至几百万个网页;二、区别于传统搜索引擎,可以获取更加准确的数据信息;三、与 API 获取数据相比,网页爬虫灵活性更强 4、网页爬虫运用于:市场预测、机器语言翻译、医疗诊断领域、新闻网站、文章、健康论坛、宏观经济、生物基因、国际关系、健康论坛、艺术领域等方面数据获取和分析(分类和聚合) 5、网页爬虫涉及:数据库、网络服务器、HTTP协议、HTML语言(超文本标记语言 H yper T ext M arkup L anguage)、网络安全、图像处理、数据科学等反面知识 6、网页的组成:HTML文本格式层、CSS样式层( C ascading S tyle S heets)、JavaScript 执行层、图像渲染层 7、 JavaScript思路 :(1)借鉴C语言的基本语法、(2)借鉴Java语言的数据类型和内存管理、(3)借鉴Scheme语言,将函数提升到"第一等公民"(first class)的地位、(4)借鉴Self语言

How to remove namespace value from inside lxml.html.html5paser element tag

狂风中的少年 提交于 2019-12-08 04:32:09
问题 Is it possible not to add namespace for the tag when using html5parser from the lxml.html package? Example: from lxml import html print(html.parse('http://example.com').getroot().tag) # You will get 'html' from lxml.html import html5parser print(html5parser.parse('http://example.com').getroot().tag) # You will get '{http://www.w3.org/1999/xhtml}html' The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all? 回答1: There is a specific

trying to install lxml on max osx leopard

若如初见. 提交于 2019-12-08 04:09:06
问题 I have tried lots of different guides.. this one gets me the furthest.. CFLAGS="$CFLAGS -lgcrypt -fPIC" STATIC_DEPS=true easy_install-2.6 lxml however after installing all dependencies I get this error message over and over again: install-NRDNAB/lxml-2.3/build/tmp/libxml2/lib/pkgconfig" /usr/bin/install -c -m 644 libxslt.m4 '/private/tmp/easy_install-NRDNAB/lxml-2.3/build/tmp/libxml2/share/aclocal' /usr/bin/install -c -m 644 xsltConf.sh '/private/tmp/easy_install-NRDNAB/lxml-2.3/build/tmp