lxml | 易学教程

xmlns namespace breaking lxml

阅读更多关于 xmlns namespace breaking lxml

问题 I am trying to open an xml file, and get values from certain tags. I have done this a lot but this particular xml is giving me some issues. Here is a section of the xml file: <?xml version='1.0' encoding='UTF-8'?> <package xmlns="http://apple.com/itunes/importer" version="film4.7"> <provider>filmgroup</provider> <language>en-GB</language> <actor name="John Smith" display="Doe John"</actor> </package> And here is a sample of my python code: metadata = '/Users/mylaptop/Desktop/Python/metadata

Import error for lxml in python

阅读更多关于 Import error for lxml in python

问题 I wrote a script some times ago that contain from lxml import etree But, unfortunatly it is not working anymore. In doubt i checked installation with : sudo apt-get install python-lxml sudo pip install lxml sudo apt-get install libxml2-dev sudo apt-get install libxslt1-dev I checked if it could be my python version with : me@pc:~$ python Python 2.7.3 (default, Sep 14 2012, 14:11:57) [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2 Type "help", "copyright", "credits" or "license"

How to remove brackets from python string?

阅读更多关于 How to remove brackets from python string?

问题 I know from the title you might think that this is a duplicate but it's not. for id,row in enumerate(rows): columns = row.findall("td") teamName = columns[0].find("a").text, # Lag playedGames = columns[1].text, # S wins = columns[2].text, draw = columns[3].text, lost = columns[4].text, dif = columns[6].text, # GM-IM points = columns[7].text, # P - last column dict[divisionName].update({id :{"teamName":teamName, "playedGames":playedGames, "wins":wins, "draw":draw, "lost":lost, "dif":dif,

using xpath to select an element after another

阅读更多关于 using xpath to select an element after another

问题 I've seen similar questions, but the solutions I've seen won't work on the following. I'm far from an XPath expert. I just need to parse some HTML. How can I select the table that follows Header 2. I thought my solution below should work, but apparently not. Can anyone help me out here? content = """<div> <p><b>Header 1</b></p> <p><b>Header 2</b><br></p> <table> <tr> <td>Something</td> </tr> </table> </div> """ from lxml import etree tree = etree.HTML(content) tree.xpath("//table/following::p

Scrapy框架安装配置小结

阅读更多关于 Scrapy框架安装配置小结

Windows 平台：系统是 Win7 Python 2.7.7版本官网文档： http://doc.scrapy.org/en/latest/intro/install.html 1.安装Python 电脑中安装好 Python 2.7.7 版本，安装完之后需要配置环境变量，比如我的安装在D盘，D:\python2.7.7，就把以下两个路径添加到Path变量中 1 D : \ python2 . 7.7 ; D : \ python2 . 7.7 \ Scripts 配置好了之后，在命令行中输入 python –version，如果没有提示错误，则安装成功 2.安装pywin32 在windows下，必须安装pywin32，在 http://sourceforge.net/projects/pywin32/files/ 这里点击进去后选择对应的版本（注意要与安装的python版本对应），下载后也是双击运行，直接下一步一路完成。安装完毕之后验证：在python命令行下输入 import win32com 如果没有提示错误，则证明安装成功 3.安装pip pip是用来安装其他必要包的工具，首先下载 get-pip.py 下载好之后，选中该文件所在路径，执行下面的命令 1 python get - pip . py 执行命令后便会安装好pip，并且同时，它帮你安装了

Fast and effective way to parse broken HTML?

阅读更多关于 Fast and effective way to parse broken HTML?

问题 I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages. Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the

Getting non-contiguous text with lxml / ElementTree

阅读更多关于 Getting non-contiguous text with lxml / ElementTree

问题 Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree: <div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div> If I already have the div element as mydiv, then mydiv.text returns just "text1". Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div. Is there any simple/elegant way to extract a non-first text chunk from an element? 回答1: Well, lxml.etree provides full XPath support, which

Cannot install lxml on windows, fatal error C1083: Cannot open include file: 'libxml/xmlversion.h'

阅读更多关于 Cannot install lxml on windows, fatal error C1083: Cannot open include file: 'libxml/xmlversion.h'

问题 Python noob, please bear with me. I used python installer for v3.5.1 from www.python.org. My intent was to use Scrapy to run some scripts. pip install scrapy failed, as did easy_install scrapy and others. I traced the error to a faulty install of lxml. Here is the error log. I've even tried easy_installing libxml2, I'm not sure how to proceed. Building lxml version 3.5.0. Building without Cython. ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program

Valid XPath expression

阅读更多关于 Valid XPath expression

问题 Just two questions: How can I check if the string assigned to a variable corresponds to a valid XPath expression? How can I return a customized error message in case the requested resource does not exist? 回答1: If the XPath is invalid, you'll get an exception. If the requested node does not exist, you'll get an empty result set. For example: from lxml import etree from StringIO import StringIO tree = etree.parse(StringIO('<foo><bar></bar></foo>')) try: tree.xpath('\BAD XPATH') print '1. Valid

Valid XPath expression

阅读更多关于 Valid XPath expression