lxml

lxml: cssselect(): AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'

旧时模样 提交于 2019-12-06 13:41:44
Can someone explain why the first call to root.cssselect() works, while the second fails? from lxml.html import fromstring from lxml import etree html='<html><a href="http://example.com">example</a></html' root = fromstring(html) print 'via fromstring', repr(root) # via fromstring <Element html at 0x...> print root.cssselect("a") root2 = etree.HTML(html) print 'via etree.HTML()', repr(root2) # via etree.HTML() <Element html at 0x...> root2.cssselect("a") # --> Exception I get: Traceback (most recent call last): File "/home/foo_eins_d/src/foo.py", line 11, in <module> root2.cssselect("a")

Extract value from element when second namespace is used in lxml

一曲冷凌霜 提交于 2019-12-06 13:30:47
I am able to extract values from elements (using lxml in python 2.7) when one namespace is used. However I can't figure out how to extract values when a second namespace is used. I want to extract the value within //cc-cpl:MainClosedCaption/Id but I keep getting lxml.etree.XPathEvalError: Invalid expression errors. To be specific, the value I'm trying to exract from my sample xml is urn:uuid:6ca58b51-9116-4131-8652-feaed20dca0d Here's a snipped of the xml (from a Digital Cinema Package): <?xml version="1.0" encoding="UTF-8"?> <CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL

How to read an html table with multiple tbodies with python pandas' read_html?

一曲冷凌霜 提交于 2019-12-06 13:14:11
This is my html: import pandas as pd html_table = '''<table> <thead> <tr><th>Col1</th><th>Col2</th> </thead> <tbody> <tr><td>1a</td><td>2a</td></tr> </tbody> <tbody> <tr><td>1b</td><td>2b</td></tr> </tbody> </table>''' If I run df = pd.read_html(html_table) , and then print(df[0] I get: Col1 Col2 0 1a 2a Col 2 disappears. Why? How to prevent it? The HTML you have posted is not a valid one . Multiple tbody s is what confuses the pandas parser logic. If you cannot fix the input html itself, you have to pre-parse it and "unwrap" all the tbody elements: import pandas as pd from bs4 import

Parse paragraphs from HTML using lxml

随声附和 提交于 2019-12-06 13:02:45
I am new to lxml and want to extract <p>PARAGRAPHS</p> and <li>PARAGRAPHS</li> from a given url and use them for further steps. I followed an example from a post , and tried the following code with no luck: html = lxml.html('http://www.google.com/intl/en/about/corporate/index.html') url = 'http://www.google.com/intl/en/about/corporate/index.html' print html.parse.xpath('//p/text()') I tried to look into the examples in lxml.html , but didn't find any example using url. Could you give me any hint on what methods should I use? Thanks. import lxml.html htmltree = lxml.html.parse('http://www

Write xml with a path and value

自闭症网瘾萝莉.ら 提交于 2019-12-06 13:01:22
问题 I have a list of paths and values, something like this: [ {'Path': 'Item/Info/Name', 'Value': 'Body HD'}, {'Path': 'Item/Genres/Genre', 'Value': 'Action'}, ] And I want to build out the full xml structure, which would be: <Item> <Info> <Name>Body HD</Name> </Info> <Genres> <Genre>Action</Genre> </Genres> </Item> Is there a way to do this with lxml ? Or how could I build a function to fill in the inferred paths? 回答1: You could do something like: l = [ {'Path': 'Item/Info/Name', 'Value': 'Body

XPath 爬虫解析库

我的梦境 提交于 2019-12-06 10:32:21
XPath     XPath,全称 XML Path Language,即 XML 路径语言,它是一门在 XML 文档中查找信息的语言。最初是用来搜寻 XML 文档的,但同样适用于 HTML 文档的搜索。所以在做爬虫时完全可以使用 XPath 做相应的信息抽取。 1. XPath 概览     XPath 的选择功能十分强大,它提供了非常简洁明了的路径选择表达式。另外,它还提供了超过 100 个内建函数,用于字符串、数值、时间的匹配以及节点、序列的处理等,几乎所有想要定位的节点都可以用 XPath 来选择。     官方文档: https://www.w3.org/TR/xpath/ 2. XPath 常用规则 表达式 描述 nodename 选取此节点的所有子节点 / 从当前节点选区直接子节点 // 从当前节点选取子孙节点 . 选取当前节点 .. 选取当前节点的父节点 @ 选取属性     这里列出了 XPath 的常用匹配规则,示例如下: //title[@lang='eng']     这是一个 XPath 规则,代表的是选择所有名称为 title,同时属性 lang 的值为 eng 的节点,后面会通过 Python 的 lxml 库,利用 XPath 进行 HTML 的解析。 3. 安装 windows->python3环境下:pip install lxml 4.

python爬虫--数据解析

一个人想着一个人 提交于 2019-12-06 10:27:58
数据解析 什么是数据解析及作用 概念:就是将一组数据中的局部数据进行提取 作用:来实现聚焦爬虫 数据解析的通用原理 标签定位 取文本或者属性 正则解析 正则回顾 单字符: . : 除换行以外所有字符 [] :[aoe] [a-w] 匹配集合中任意一个字符 \d :数字 [0-9] \D : 非数字 \w :数字、字母、下划线、中文 \W : 非\w \s :所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。 \S : 非空白 数量修饰: * : 任意多次 >=0 + : 至少1次 >=1 ? : 可有可无 0次或者1次 {m} :固定m次 hello{3,} {m,} :至少m次 {m,n} :m-n次 边界: $ : 以某某结尾 ^ : 以某某开头 分组: (ab) 贪婪模式: .* 非贪婪(惰性)模式: .*? re.I : 忽略大小写 re.M :多行匹配 re.S :单行匹配 re.sub(正则表达式, 替换内容, 字符串) 正则练习 import re #提取出python key="javapythonc++php" res = re.findall('python',key)[0] #re.findall('python',key)返回的结果是列表类型的数据 print(res) #提取出hello world key="<html>

Python lxml - using the xml:lang attribute to retrieve an element

偶尔善良 提交于 2019-12-06 10:00:34
I have some xml which has multiple elements with the same name, but each is in a different language, for example: <Title xml:lang="FR" type="main">Les Tudors</Title> <Title xml:lang="DE" type="main">Die Tudors</Title> <Title xml:lang="IT" type="main">The Tudors</Title> Normally, I'd retrieve an element using its attributes as follows: titlex = info.find('.//xmlns:Title[@someattribute=attributevalue]', namespaces=nsmap) If I try and do this with [@xml:lang="FR"] (for example), I get the traceback error: File "D:/Python code/RBM CRID, Title, Genre/CRID, Title, Genre, Age rating, Episode Number,

Python 3.4 : How to do xml validation

别来无恙 提交于 2019-12-06 09:18:06
I'm trying to do XML validation against some XSD in python. I was successful using lxml package. But the problem starts when I tried to port my code into python 3.4. I tried to install lxml for 3.4 version. Looks like my enterprise linux doesn't play very well with lxml. pip installation: pip install lxml Collecting lxml Downloading lxml-3.4.4.tar.gz (3.5MB) 100% |################################| 3.5MB 92kB/s Installing collected packages: lxml Running setup.py install for lxml Successfully installed lxml-3.4.4 After pip Installation : > python Python 3.4.1 (default, Nov 12 2014, 13:34:29)

lxml build on Solaris 10

元气小坏坏 提交于 2019-12-06 09:17:11
Please can you help and advise with a problem with python 2.6.6 and lxml Solaris 10 build? Installation instructions: www.sunfreeware.com/download.html direct link to the file: http://www.sunfreeware.com/ftp/pub/freeware/sparc/10/lxml-2.2.8-sol10-sparc-local.gz [rainier]/usr/apps/openet/bmsystest/relAuto/RAP_SW> python Python 2.6.6 (r266:84292, Oct 12 2010, 15:25:47) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml >>> from lxml import etree Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: ld.so.1: python: