lxml

How to install lxml on Windows

本小妞迷上赌 提交于 2019-12-17 19:11:48
问题 I'm trying to install lmxl on my Windows 8.1 laptop with Python 3.4 and failing miserably. First off, I tried the simple and obvious solution: pip install lxml . However, this didn't work. Here's what it said: Downloading/unpacking lxml Running setup.py (path:C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py) egg_info for package lxml Building lxml version 3.4.2. Building without Cython. ERROR: b"'xslt-config' is not recognized as an internal or external command,\r

python install lxml on mac os 10.10.1

你离开我真会死。 提交于 2019-12-17 16:35:45
问题 I bought a new macbook and I am so new to mac os. However, I read a lot on internet about how to install scrap I did everything, but i have a problem with installing lxml I tried this on terminal pip install lxml and a lot of stuff started to be downloading and many text was written on the terminal, but i got this error message on red in the terminal 1 error generated. error: command '/usr/bin/clang' failed with exit status 1 ---------------------------------------- Cleaning up... Command

In lxml, how do I remove a tag but retain all contents?

烈酒焚心 提交于 2019-12-17 15:43:31
问题 The problem is this: I have an XML fragment like so: <fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment> For the result, I want to remove all <a> - and <c> -Tags, but retain their (text)-contents, and childnodes just as they are. Also, the <b> -Element should be left untouched. The result should then look thus <fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment> For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove

How to find recursively for a tag of XML using LXML?

会有一股神秘感。 提交于 2019-12-17 15:35:48
问题 <?xml version="1.0" ?> <data> <test > <f1 /> </test > <test2 > <test3> <f1 /> </test3> </test2> <f1 /> </data> Using lxml is it possible to find recursively for tag " f1 "? I tried findall method but it works only for immediate children. I think I should go for BeautifulSoup for this !!! 回答1: You can use XPath to search recursively: >>> from lxml import etree >>> q = etree.fromstring('<xml><hello>a</hello><x><hello>b</hello></x></xml>') >>> q.findall('hello') # Tag name, first level only. [

Efficient way to iterate through xml elements

帅比萌擦擦* 提交于 2019-12-17 15:33:24
问题 I have a xml like this: <a> <b>hello</b> <b>world</b> </a> <x> <y></y> </x> <a> <b>first</b> <b>second</b> <b>third</b> </a> I need to iterate through all <a> and <b> tags, but I don't know how many of them are in document. So I use xpath to handle that: from lxml import etree doc = etree.fromstring(xml) atags = doc.xpath('//a') for a in atags: btags = a.xpath('b') for b in btags: print b It works, but I have pretty big files, and cProfile shows me that xpath is very expensive to use. I

how do I use empty namespaces in an lxml xpath query?

微笑、不失礼 提交于 2019-12-17 10:59:56
问题 I have an xml document in the following format: <feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:gsa="http://schemas.google.com/gsa/2007"> ... <entry> <id>https://ip.ad.dr.ess:8000/feeds/diagnostics/smb://ip.ad.dr.ess/path/to/file</id> <updated>2011-11-07T21:32:39.795Z</updated> <app:edited xmlns:app="http://purl.org/atom/app#">2011-11-07T21:32:39.795Z</app:edited> <link rel="self" type="application/atom+xml" href="https://ip.ad.dr.ess

Parse SGML with Open Arbitrary Tags in Python 3

你离开我真会死。 提交于 2019-12-17 10:59:17
问题 I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml I am using Python 3 and have been unable to find a solution with existing libraries to parse an SGML file with open tags. SGML allows implicitly closed tags. When attempting to parse the example file with LXML, XML, or beautiful soup I end up with implicitly closed tags being closed at the end of the file instead of at the end of line. For example: <COMPANY

How to use regular expression in lxml xpath?

我只是一个虾纸丫 提交于 2019-12-17 07:13:42
问题 I'm using construction like this: doc = parse(url).getroot() links = doc.xpath("//a[text()='some text']") But I need to select all links which have text beginning with "some text", so I'm wondering is there any way to use regexp here? Didn't find anything in lxml documentation 回答1: You can do this (although you don't need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class, but it also works for the

python-爬虫-Beautifulsoup模块

妖精的绣舞 提交于 2019-12-17 05:32:39
一 介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib 下表列出了主要的解析器,以及它们的优缺点,官网推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib,

BeautifulSoup---学习

两盒软妹~` 提交于 2019-12-17 03:27:06
BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库,它的使用方式相对于正则来说更加的简单方便,常常能够节省我们大量的时间。 官方中文文档的: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 以下进行一些总结。 可用的解析器   以下是主要的几种解析器: 解析器 使用方法 优势 劣势 Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库执行速度适中文档容错能力强 Python 2.7.3 or 3.2.2)前 的版本 中文档容错能力差 lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快文档容错能力强 需要安装C语言库 lxml XML 解析器 BeautifulSoup(markup, ["lxml", "xml"])``BeautifulSoup(markup, "xml") 速度快唯一支持XML的解析器 需要安装C语言库 html5lib BeautifulSoup(markup, "html5lib") 最好的容错性以浏览器的方式解析文档生成HTML5格式的文档 速度慢不依赖外部扩展 有的时候,lxml 需要单独安装: pip install