lxml | 易学教程

Python爬虫：BeautifulSoup库

阅读更多关于 Python爬虫：BeautifulSoup库

Beautiful Soup的简介 Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下： 1、Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。 2、Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。 3、Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。各种解析器优缺点 Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装 (pip install lxml) 解析器使用方法优势劣势 Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中

How to get an attribute of an Element that is namespaced

阅读更多关于 How to get an attribute of an Element that is namespaced

I'm parsing an XML document that I receive from a vendor everyday and it uses namespaces heavily. I've minimized the problem to a minimal subset here: There are some elements I need to parse, all of which are children of an element with a specific attribute in it. I am able to use lxml.etree.Element.findall(TAG, root.nsmap) to find the candidate nodes whose attribute I need to check. I'm then trying to check the attribute of each of these Elements via the name I know it uses : which concretely here is ss:Name . If the value of that attribute is the desired value I'm going to dive deeper into

How to set up XPath query for HTML parsing?

阅读更多关于 How to set up XPath query for HTML parsing?

问题 Here is some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0 in Google Chrome that I want to parse the website for some project. <div id="names"> <h2>Names and Synonyms</h2> <div class="ds"><button class="toggle1Col"title="Toggle display between 1 column of wider results and multiple columns.">↔</button> <h3 id="yui_3_18_1_3_1434394159641_407">Name of Substance</h3> <ul> <li id="ds2"> `` <div>Acetaldehyde</div> </li> </ul> </div> I wrote a python script to help me do such a

Python lxml/beautiful soup to find all links on a web page

阅读更多关于 Python lxml/beautiful soup to find all links on a web page

I am writing a script to read a web page, and build a database of links that matches a certain criteria. Right now I am stuck with lxml and understanding how to grab all the <a href> 's from the html... result = self._openurl(self.mainurl) content = result.read() html = lxml.html.fromstring(content) print lxml.html.find_rel_links(html,'href') Use XPath. Something like (can't test from here): urls = html.xpath('//a/@href') Gregory Petukhov With iterlinks , lxml provides an excellent function for this task. This yields (element, attribute, link, pos) for every link [...] in an action, archive,

lxml with schema 1.1

阅读更多关于 lxml with schema 1.1

问题 I'm trying to use lxml with the xs:assert validation tag. I've tried using the example from this IBM page: http://www.ibm.com/developerworks/library/x-xml11pt2/ <xs:element name="dimension"> <xs:complexType> <xs:attribute name="height" type="xs:int"/> <xs:attribute name="width" type="xs:int"/> <xs:assert test="@height < @width"/> </xs:complexType> </xs:element> It seems like lxml doesn't support XML Schema 1.1. Can someone validate this? What XML (for Python) engine does support Schema 1.1?

XPath predicate with sub-paths with lxml?

阅读更多关于 XPath predicate with sub-paths with lxml?

I'm trying to understand and XPath that was sent to me for use with ACORD XML forms (common format in insurance). The XPath they sent me is (truncated for brevity): ./PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo Where I'm running into trouble is that Python's lxml library is telling me that [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"] is an invalid predicate . I'm not able to find anywhere in the XPath spec on predicates which identifies this syntax so that I can modify this predicate to work. Is there any documentation

Import error for lxml in python

阅读更多关于 Import error for lxml in python

I wrote a script some times ago that contain from lxml import etree But, unfortunatly it is not working anymore. In doubt i checked installation with : sudo apt-get install python-lxml sudo pip install lxml sudo apt-get install libxml2-dev sudo apt-get install libxslt1-dev I checked if it could be my python version with : me@pc:~$ python Python 2.7.3 (default, Sep 14 2012, 14:11:57) [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml Traceback (most recent call last): File "<stdin>", line 1, in

lxml html5parser ignores “namespaceHTMLElements=False” option

阅读更多关于 lxml html5parser ignores “namespaceHTMLElements=False” option

问题 The lxml html5parser seems to ignore any namespaceHTMLElements=False option I pass to it. It puts all elements I give it into the HTML namespace instead of the (expected) void namespace. Here’s a simple case that reproduces the problem: echo "<p>" | python -c "from sys import stdin; \ from lxml.html import html5parser as h5, tostring; \ print tostring(h5.parse(stdin, h5.HTMLParser(namespaceHTMLElements=False)))" The output from that is this: <html:html xmlns:html="http://www.w3.org/1999/xhtml

parsing large xml file with Python - etree.parse error

阅读更多关于 parsing large xml file with Python - etree.parse error

Trying to parse the following Python file using the lxml.etree.iterparse function. "sampleoutput.xml" <item> <title>Item 1</title> <desc>Description 1</desc> </item> <item> <title>Item 2</title> <desc>Description 2</desc> </item> I tried the code from Parsing Large XML file with Python lxml and Iterparse before the etree.iterparse(MYFILE) call I did MYFILE = open("/Users/eric/Desktop/wikipedia_map/sampleoutput.xml","r") But it turns up the following error Traceback (most recent call last): File "/Users/eric/Documents/Programming/Eclipse_Workspace/wikipedia_mapper/testscraper.py", line 6, in

Using Python and lxml to strip only the tags that have certain attributes/values

阅读更多关于 Using Python and lxml to strip only the tags that have certain attributes/values

I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values. For instance: I'd like to strip all span or div tags (or other elements) from a tree ( xhtm l) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched. Conversely: I'd like a way to strip all "naked" spans or divs from a tree. Meaning only those spans / divs