lxml

Ubuntu/Debian 安装lxml的正确方式

夙愿已清 提交于 2020-01-16 01:21:20
lxml是Python的一个库,主要用于处理XML和HTML。 最近需要用lxml,但是在Ubuntu上直接pip安装失败,研究了半天终于找到了正确安装方法,记录在此。 由于Ubuntu和Debian安装软件方法一样,所以两个系统都适用。 $sudo apt-get install libxml2-dev libxslt-dev python2.7-dev $sudo pip install lxml 你可以把2.7换成你的Python版本。 这个会安装最新版,还有一种方法比较简单,但是安装的可能是比较旧的版本,列出来供大家参考: $sudo apt-get install python-lxml 来源: https://www.cnblogs.com/numbbbbb/p/3434519.html

Define default namespace (unprefixed) in lxml

限于喜欢 提交于 2020-01-15 07:01:27
问题 When rendering XHTML with lxml, everything is fine, unless you happen to use Firefox, which seems unable to deal with namespace-prefixed XHTML elements and javascript. While Opera is able to execute the javascript (this applies to both jQuery and MathJax) fine, no matter whether the XHTML namespace has a prefix ( h: in my case) or not, in Firefox the scripts will abort with weird errors ( this.head is undefined in the case of MathJax). I know about the register_namespace function, but it does

Parsing large XML using iterparse() consumes too much memory. Any alternative?

ぃ、小莉子 提交于 2020-01-15 06:16:12
问题 I am using python 2.7 with latest lxml library. I am parsing a large XML file with very homogenous structure and millions of elements. I thought lxml's iterparse would not build an internal tree while it parses, but apparently it does since memory usage grows until it crashes (around 1GB). Is there a way to parse large XML file using lxml without using a lot of memory? I saw the target parser interface as one possibility, but I'm not sure if that will work any better. 回答1: Try using Liza Daly

Python, lxml - access text

天涯浪子 提交于 2020-01-14 16:35:20
问题 I m currently a bit out of ideas, and I really hope that you can give me a hint: Its probably best to explain my question with a small piece of sample code: from lxml import etree from io import StringIO testStr = "<b>text0<i>text1</i><ul><li>item1</li><li>item2</li></ul>text2<b/><b>sib</b>" parser = etree.HTMLParser() # generate html tree htmlTree = etree.parse(StringIO(testStr), parser) print(etree.tostring(htmlTree, pretty_print=True).decode("utf-8")) bElem = htmlTree.getroot().find("body

value attribute for lxml.html

二次信任 提交于 2020-01-14 13:26:31
问题 Here is my code: from lxml.html import fromstring #code print fromstring(s).xpath('/html/body/div[3]/div/div[2]/div/form/input[4]') Ouput is [<InputElement 2946d20 name='question' type='hidden'>] How can I output the value? Any attribute for this? Thank you. 回答1: In general with lxml you can access an element's value directly via the .value attribute: >>> from lxml.html import fromstring >>> s = """<input type="hidden" name="question" value="1234">""" >>> doc = fromstring(s) >>> doc.value

Override lxml behavior to write a closing and opening element for Null tags

本秂侑毒 提交于 2020-01-14 13:07:32
问题 root = etree.Element('document') rootTree = etree.ElementTree(root) firstChild = etree.SubElement(root, 'test') The output is: <document> <test/> </document I want the output to be: <document> <test> </test> </document> I know both are equivalent but is there a way to get the output that i want . 回答1: Set the method argument of tostring to html . As in: etree.tostring(root, method="html") Reference: Close a tag with no text in lxml 回答2: Here is how you can do it: from lxml import etree root =

python3.6安装lxml库

a 夏天 提交于 2020-01-14 10:27:07
好像是在python3.5之后,安装了lxml也无法使用etree 为了就解决这个问题使用如下方法: 1、下载lxml的wheel文件,下载地址: https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 2、将下载的wheel文件放到 路径下 3、pip安装wheel文件 pip install lxml-4.4.2-cp36-cp36m-win_amd64.whl 4、验证安装 cmd先进入python界面,再输入 from lxml import etree   无报错表示成功 PS.如果Pycharm 使用from lxml import etree 报了个错Unresolved reference 'etree',但是能运行... 解决方法: 其他提示unresolved reference应该也适用吧: 1、点击菜单栏上的File -> Setting ->Build,Executing,Development ->Console -> Python Console 2、将Add source roots to PYTHONPATH勾选上 3、点击Apply 来源: https://www.cnblogs.com/wulixia/p/12190373.html

Python的lxml库学习

删除回忆录丶 提交于 2020-01-13 10:08:20
lxml 是一个Python库,使用它可以轻松处理XML和HTML文件,还可以用于web爬取。市面上有很多现成的XML解析器,但是为了获得更好的结果,开发人员有时更愿意编写自己的XML和HTML解析器。这时lxml库就派上用场了。这个库的主要优点是易于使用,在解析大型文档时速度非常快,归档的也非常好,并且提供了简单的转换方法来将数据转换为Python数据类型,从而使文件操作更容易。 安装 通过国内镜像安装就可以,在电脑上打开命令窗口(win+R,然后输入cmd),用pip安装: pip install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com 这个命令是从豆瓣上获取资源并安装的,当然也可以直接安装: pip install lxml 直接安装稍微慢一些吧。除非修改默认安装源(方法请自行搜索)。 现在,您已经在本地机器上安装了lxml库的副本。现在我们来动手实践一下,看看使用这个库可以做哪些很酷的事情。 功能 要在程序中使用lxml库,首先需要导入它。您可以使用以下命令: 这将从lxml库中导入我们感兴趣的etree模块。 创建HTML / XML文档 使用etree模块,我们可以创建XML/HTML元素及其子元素,这在我们试图写入或操作HTML或XML文件时非常有用

How to parse HTML table against a list of variables using lxml?

你。 提交于 2020-01-13 07:03:09
问题 I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') fetches the results, I am trying to extract the column contents only when it starts with a variable in my config file. For instance, if a <td> starts with 'Street 1', I then want to grab the <span> contents of that <td> tag. This way, I can have a tuple of tuples (which takes care of the None values) which I can then store in the database. lxml_parse.py import lxml.html as lh doc

Python lxml parsing svg file

六眼飞鱼酱① 提交于 2020-01-13 03:05:02
问题 I'm trying to parse .svg files from http://kanjivg.tagaini.net/ , but I can't successfully extract the information inside. Edit 1: (full file) http://www.filedropper.com/0f9ab A part of 0f9ab.svg looks like this: <svg xmlns="http://www.w3.org/2000/svg" width="109" height="109" viewBox="0 0 109 109"> <g id="kvg:StrokePaths_0f9ab" style="fill:none;stroke:#000000;stroke-width:3;stroke-linecap:round;stroke-linejoin:round;"> <g id="kvg:0f9ab" kvg:element="嶺"> <g id="kvg:0f9ab-g1" kvg:element="山"