lxml | 易学教程

How do I match contents of an element in XPath (lxml)?

阅读更多关于 How do I match contents of an element in XPath (lxml)?

问题 I want to parse HTML with lxml using XPath expressions. My problem is matching for the contents of a tag: For example given the <a href="http://something">Example</a> element I can match the href attribute using .//a[@href='http://something'] but the given the expression .//a[.='Example'] or even .//a[contains(.,'Example')] lxml throws the 'invalid node predicate' exception. What am I doing wrong? EDIT: Example code: from lxml import etree from cStringIO import StringIO html = '<a href="http:

Remove all javascript tags and style tags from html with python and the lxml module

阅读更多关于 Remove all javascript tags and style tags from html with python and the lxml module

问题 I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain all contents? but the method described in that post leaves all the text, stripping the tags with out removing the actual script. I have also found a class reference to lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html but this is clear as mud as to how to actually use the class to clean the

Entity references and lxml

阅读更多关于 Entity references and lxml

问题 Here's the code I have: from cStringIO import StringIO from lxml import etree xml = StringIO('''<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE root [ <!ENTITY test "This is a test"> ]> <root> <sub>&test;</sub> </root>''') d1 = etree.parse(xml) print '%r' % d1.find('/sub').text parser = etree.XMLParser(resolve_entities=False) d2 = etree.parse(xml, parser=parser) print '%r' % d2.find('/sub').text Here's the output: 'This is a test' None How do I get lxml to give me '&test;' , i.e., the raw

pip install lxml mysql-python error

阅读更多关于 pip install lxml mysql-python error

问题0: 在安装 mysql-python时，会出现： sh: mysql_config: not found Traceback (most recent call last): File "setup.py", line 15, in <module> metadata, options = get_config() File "/home/zhxia/apps/source/MySQL-python-1.2.3/setup_posix.py", line 43, in get_config libs = mysql_config("libs_r") File "/home/zhxia/apps/source/MySQL-python-1.2.3/setup_posix.py", line 24, in mysql_config raise EnvironmentError("%s not found" % (mysql_config.path,)) EnvironmentError: mysql_config not found 只要原因是没有安装:libmysqlclient-dev sudo apt-get install libmysqlclient-dev 找到mysql_config文件的路径 sudo updatedb locate mysql_config

lxml: add namespace to input file

阅读更多关于 lxml: add namespace to input file

问题 I am parsing an xml file generated by an external program. I would then like to add custom annotations to this file, using my own namespace. My input looks as below: <sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" level="2" version="4"> <model metaid="untitled" id="untitled"> <annotation>...</annotation> <listOfUnitDefinitions>...</listOfUnitDefinitions> <listOfCompartments>...</listOfCompartments> <listOfSpecies> <species

libxml install error using pip

阅读更多关于 libxml install error using pip

问题 This is my error: (mysite)zjm1126@zjm1126-G41MT-S2:~/zjm_test/mysite$ pip install lxml Downloading/unpacking lxml Running setup.py egg_info for package lxml Building lxml version 2.3. Building without Cython. ERROR: /bin/sh: xslt-config: not found ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt Installing collected packages: lxml Running setup.py install for lxml Building lxml version 2.3. Building without Cython. ERROR: /bin

Python爬虫之lxml-etree和xpath的结合使用

阅读更多关于 Python爬虫之lxml-etree和xpath的结合使用

本篇文章给大家介绍的是Python爬虫之lxml-etree和xpath的结合使用（附案例），内容很详细，希望可以帮助到大家。 lxml:python的HTML / XML的解析器官网文档： https://lxml.de/ 使用前需要安装lxml包终端输入（win7.8,10在cmd输入）pip install -i https://pypi.tuna.tsinghua.edu.cn/simple lxml 功能： 1 解析html:使用etree.html(text)将字符串格式的 html片段解析成 html 文档 2 读取xml文件 3 etree和xpath配合使用（本文主要介绍）示例：etree和xpath配合使用 # lxml-etree读取文件from lxml import etree xml = etree.parse("./py24.xml") print(type(xml))# 查找所有 book 节点rst = xml.xpath('//book') print(type(rst)) print(rst)# 查找带有 category 属性值为 sport 的元素rst2 = xml.xpath('//book[@category="sport"]') print(type(rst2)) print(rst2)#

Pip install failed in openshift 3

阅读更多关于 Pip install failed in openshift 3

问题 I want to use the new platform Openshift 3 but I can't install lxml for Weblate with pip when build process is launch. In logs the last line is " Running setup.py install for lxml " but no more error How can I found what happened ? Thanks 回答1: Some of the packages around data analytics when compiled with compiler optimisations can chew up too much memory and hit the default memory limit for builds. Try following steps outlined in: Pandas on OpenShift v3 Is less likely, but just in case is the

Replace `\n` in html page with space in python LXML

阅读更多关于 Replace `\n` in html page with space in python LXML

问题 I have an unclear xml and process it with python lxml module. I want replace all \n in content with space before any processing, how can I do this work for text of all elements. edit my xml example: <root> <a> dsdfs\n dsf\n sdf\n</a> <bds> <d>sdf\n\n\n\n\n\n</d> <d>sdf\n\n\nsdf\nsdf\n\n</d> </bds> .... .... .... .... </root> and i wan't to get this in output when i print ittertext: root = #get root element for i in root.ittertext(): print i dsdfs dsf sdf dsdfs dsf sdf sdf nsdf sdf 回答1: Below

Generate xml documents using lxml and vary element text and attributes based on logic

阅读更多关于 Generate xml documents using lxml and vary element text and attributes based on logic

问题 I have my lxml code like this from lxml import etree import sys fd = open('D:\\text.xml', 'wb') xmlns = "http://www.fpml.org/FpML-5/confirmation" xsi = "http://www.w3.org/2001/XMLSchema-instance" fpmlVersion="http://www.fpml.org/FpML-5/confirmation ../../fpml-main-5-6.xsd http://www.w3.org/2000/09/xmldsig# ../../xmldsig-core-schema.xsd" page = etree.Element("{"+xmlns+"}dataDocument",nsmap={None:xmlns,'xsi':xsi }) doc = etree.ElementTree(page) page.set("fpmlVersion", fpmlVersion) trade = etree