lxml | 易学教程

lxml: cssselect(): AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'

阅读更多关于 lxml: cssselect(): AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'

Can someone explain why the first call to root.cssselect() works, while the second fails? from lxml.html import fromstring from lxml import etree html='<html><a href="http://example.com">example</a></html' root = fromstring(html) print 'via fromstring', repr(root) # via fromstring <Element html at 0x...> print root.cssselect("a") root2 = etree.HTML(html) print 'via etree.HTML()', repr(root2) # via etree.HTML() <Element html at 0x...> root2.cssselect("a") # --> Exception I get: Traceback (most recent call last): File "/home/foo_eins_d/src/foo.py", line 11, in <module> root2.cssselect("a")

Extract value from element when second namespace is used in lxml

阅读更多关于 Extract value from element when second namespace is used in lxml

I am able to extract values from elements (using lxml in python 2.7) when one namespace is used. However I can't figure out how to extract values when a second namespace is used. I want to extract the value within //cc-cpl:MainClosedCaption/Id but I keep getting lxml.etree.XPathEvalError: Invalid expression errors. To be specific, the value I'm trying to exract from my sample xml is urn:uuid:6ca58b51-9116-4131-8652-feaed20dca0d Here's a snipped of the xml (from a Digital Cinema Package): <?xml version="1.0" encoding="UTF-8"?> <CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL

How to read an html table with multiple tbodies with python pandas' read_html?

阅读更多关于 How to read an html table with multiple tbodies with python pandas' read_html?

This is my html: import pandas as pd html_table = '''<table> <thead> <tr><th>Col1</th><th>Col2</th> </thead> <tbody> <tr><td>1a</td><td>2a</td></tr> </tbody> <tbody> <tr><td>1b</td><td>2b</td></tr> </tbody> </table>''' If I run df = pd.read_html(html_table) , and then print(df[0] I get: Col1 Col2 0 1a 2a Col 2 disappears. Why? How to prevent it? The HTML you have posted is not a valid one . Multiple tbody s is what confuses the pandas parser logic. If you cannot fix the input html itself, you have to pre-parse it and "unwrap" all the tbody elements: import pandas as pd from bs4 import

Parse paragraphs from HTML using lxml

阅读更多关于 Parse paragraphs from HTML using lxml

I am new to lxml and want to extract <p>PARAGRAPHS</p> and <li>PARAGRAPHS</li> from a given url and use them for further steps. I followed an example from a post , and tried the following code with no luck: html = lxml.html('http://www.google.com/intl/en/about/corporate/index.html') url = 'http://www.google.com/intl/en/about/corporate/index.html' print html.parse.xpath('//p/text()') I tried to look into the examples in lxml.html , but didn't find any example using url. Could you give me any hint on what methods should I use? Thanks. import lxml.html htmltree = lxml.html.parse('http://www

Write xml with a path and value

阅读更多关于 Write xml with a path and value

问题 I have a list of paths and values, something like this: [ {'Path': 'Item/Info/Name', 'Value': 'Body HD'}, {'Path': 'Item/Genres/Genre', 'Value': 'Action'}, ] And I want to build out the full xml structure, which would be: <Item> <Info> <Name>Body HD</Name> </Info> <Genres> <Genre>Action</Genre> </Genres> </Item> Is there a way to do this with lxml ? Or how could I build a function to fill in the inferred paths? 回答1: You could do something like: l = [ {'Path': 'Item/Info/Name', 'Value': 'Body

XPath 爬虫解析库

阅读更多关于 XPath 爬虫解析库

XPath XPath，全称 XML Path Language，即 XML 路径语言，它是一门在 XML 文档中查找信息的语言。最初是用来搜寻 XML 文档的，但同样适用于 HTML 文档的搜索。所以在做爬虫时完全可以使用 XPath 做相应的信息抽取。 1. XPath 概览 XPath 的选择功能十分强大，它提供了非常简洁明了的路径选择表达式。另外，它还提供了超过 100 个内建函数，用于字符串、数值、时间的匹配以及节点、序列的处理等，几乎所有想要定位的节点都可以用 XPath 来选择。官方文档： https://www.w3.org/TR/xpath/ 2. XPath 常用规则表达式描述 nodename 选取此节点的所有子节点 / 从当前节点选区直接子节点 // 从当前节点选取子孙节点 . 选取当前节点 .. 选取当前节点的父节点 @ 选取属性这里列出了 XPath 的常用匹配规则，示例如下： //title[@lang='eng'] 这是一个 XPath 规则，代表的是选择所有名称为 title，同时属性 lang 的值为 eng 的节点，后面会通过 Python 的 lxml 库，利用 XPath 进行 HTML 的解析。 3. 安装 windows->python3环境下：pip install lxml 4.

python爬虫--数据解析

阅读更多关于 python爬虫--数据解析

数据解析什么是数据解析及作用概念:就是将一组数据中的局部数据进行提取作用:来实现聚焦爬虫数据解析的通用原理标签定位取文本或者属性正则解析正则回顾单字符： . : 除换行以外所有字符 [] ：[aoe] [a-w] 匹配集合中任意一个字符 \d ：数字 [0-9] \D : 非数字 \w ：数字、字母、下划线、中文 \W : 非\w \s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。 \S : 非空白数量修饰： * : 任意多次 >=0 + : 至少1次 >=1 ? : 可有可无 0次或者1次 {m} ：固定m次 hello{3,} {m,} ：至少m次 {m,n} ：m-n次边界： $ : 以某某结尾 ^ : 以某某开头分组： (ab) 贪婪模式： .* 非贪婪（惰性）模式： .*? re.I : 忽略大小写 re.M ：多行匹配 re.S ：单行匹配 re.sub(正则表达式, 替换内容, 字符串) 正则练习 import re #提取出python key="javapythonc++php" res = re.findall('python',key)[0] #re.findall('python',key)返回的结果是列表类型的数据 print(res) #提取出hello world key="<html>

Python lxml - using the xml:lang attribute to retrieve an element

阅读更多关于 Python lxml - using the xml:lang attribute to retrieve an element

I have some xml which has multiple elements with the same name, but each is in a different language, for example: <Title xml:lang="FR" type="main">Les Tudors</Title> <Title xml:lang="DE" type="main">Die Tudors</Title> <Title xml:lang="IT" type="main">The Tudors</Title> Normally, I'd retrieve an element using its attributes as follows: titlex = info.find('.//xmlns:Title[@someattribute=attributevalue]', namespaces=nsmap) If I try and do this with [@xml:lang="FR"] (for example), I get the traceback error: File "D:/Python code/RBM CRID, Title, Genre/CRID, Title, Genre, Age rating, Episode Number,

Python 3.4 : How to do xml validation

阅读更多关于 Python 3.4 : How to do xml validation

I'm trying to do XML validation against some XSD in python. I was successful using lxml package. But the problem starts when I tried to port my code into python 3.4. I tried to install lxml for 3.4 version. Looks like my enterprise linux doesn't play very well with lxml. pip installation: pip install lxml Collecting lxml Downloading lxml-3.4.4.tar.gz (3.5MB) 100% |################################| 3.5MB 92kB/s Installing collected packages: lxml Running setup.py install for lxml Successfully installed lxml-3.4.4 After pip Installation : > python Python 3.4.1 (default, Nov 12 2014, 13:34:29)

lxml build on Solaris 10

阅读更多关于 lxml build on Solaris 10

Please can you help and advise with a problem with python 2.6.6 and lxml Solaris 10 build? Installation instructions: www.sunfreeware.com/download.html direct link to the file: http://www.sunfreeware.com/ftp/pub/freeware/sparc/10/lxml-2.2.8-sol10-sparc-local.gz [rainier]/usr/apps/openet/bmsystest/relAuto/RAP_SW> python Python 2.6.6 (r266:84292, Oct 12 2010, 15:25:47) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml >>> from lxml import etree Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: ld.so.1: python:

订阅 lxml