lxml

Beautiful Soup模块使用

雨燕双飞 提交于 2019-11-27 06:07:54
1.Beautiful Soup模块的介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库,简单来说,它能将HTML的标签文件解析成树形结构,然后方便地获取到指定标签的对应属性,还可以方便的实现全站点的内容爬取和解析; Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器; lxml 是python的一个解析库,支持HTML和XML的解析,html5lib解析器能够以浏览器的方式解析,且生成HTML5文档; pip install beautifulsoup4 pip install html5lib pip install lxml 2. Beautiful Soup模块解析HTML文档 假如现在有一段不完整的HTML代码,我们现在要使用Beautiful Soup模块来解析这段HTML代码 data = ''' <html><head><title>The Dormouse's story</title></he <body> <p class="title"><b id="title">The Dormouse's story</b></p> <p class="story">Once upon a time there

Is it possible for lxml to work in a case-insensitive manner?

不打扰是莪最后的温柔 提交于 2019-11-27 06:03:52
问题 I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library. I'd like to be able to say doc.cssselect('meta[name=description]') (or some

BeautifulSoup and lxml.html - what to prefer? [duplicate]

╄→尐↘猪︶ㄣ 提交于 2019-11-27 05:37:20
问题 This question already has an answer here: Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes? 7 answers I am working on a project that will involve parsing HTML. After searching around, I found two probable options: BeautifulSoup and lxml.html Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common. I know I should use

How do I use xml namespaces with find/findall in lxml?

眉间皱痕 提交于 2019-11-27 05:36:11
问题 I'm trying to parse content in an OpenOffice ODS spreadsheet. The ods format is essentially just a zipfile with a number of documents. The content of the spreadsheet is stored in 'content.xml'. import zipfile from lxml import etree zf = zipfile.ZipFile('spreadsheet.ods') root = etree.parse(zf.open('content.xml')) The content of the spreadsheet is in a cell: table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table') We can also go straight for the rows: rows = root.findall(

How to Pretty Print HTML to a file, with indentation

随声附和 提交于 2019-11-27 05:06:17
问题 I am using lxml.html to generate some HTML. I want to pretty print (with indentation) my final result into an html file. How do I do that? This is what I have tried and got till now (I am relatively new to Python and lxml) : import lxml.html as lh from lxml.html import builder as E sliderRoot=lh.Element("div", E.CLASS("scroll"), style="overflow-x: hidden; overflow-y: hidden;") scrollContainer=lh.Element("div", E.CLASS("scrollContainer"), style="width: 4340px;") sliderRoot.append

XPath语法和lxml模块

▼魔方 西西 提交于 2019-11-27 04:41:04
什么是XPath? xpath (xml path language)是一门xml 和html文档中查找信息的语言,可以用来在xml和html中对元素和属性进行遍历 XPath开发工具 Chrome插件 XPath Helper Firefox插件 XPath Checker XPath语法 选取节点: XPath 使用路径表达式来选取xml文档中的节点或者节点集,这些路径表达式和我们在常规电脑文件系统中看到的表达式非常相似。 表达式 描述 实例 结果 nodename 选取此节点的所有子节点 bookstore 选取bookstore下所有子节点 / 若在最前面代表从根目录选取, 否者代表某个节点下的某个节点 /bookstore 选取根目录下所有的bookstore节点 // 全局节点中选取节点 //book 从全局节点中找到所有的book节点 @ 选取某个节点属性 //book[@price] 选取所有book节点的price属性 谓语(Predicates) 谓语用来查找某个特定的节点或者包含某个指定的值的节点。 谓语被嵌在方括号中。 在下面的表格中,我们列出了带有谓语的一些路径表达式,以及表达式的结果: 路径表达式 结果 /bookstore/book[1] 选取属于 bookstore 子元素的第一个 book 元素。 /bookstore/book[last()]

Get second element text with XPath?

谁都会走 提交于 2019-11-27 03:47:33
问题 <span class='python'> <a>google</a> <a>chrome</a> </span> I want to get chrome and have it working like this already. q = item.findall('.//span[@class="python"]//a') t = q[1].text # first element = 0 I'd like to combine it into a single XPath expression and just get one item instead of a list. I tried this but it doesn't work. t = item.findtext('.//span[@class="python"]//a[2]') # first element = 1 And the actual, not simplified, HTML is like this. <span class='python'> <span> <span> <img><

Pretty print in lxml is failing when I add tags to a parsed tree

隐身守侯 提交于 2019-11-27 03:45:19
问题 I have an xml file that I'm using etree from lxml to work with, but when I add tags to it, pretty printing doesn't seem to work. >>> from lxml import etree >>> root = etree.parse('file.xml').getroot() >>> print etree.tostring(root, pretty_print = True) <root> <x> <y>test1</y> </x> </root> So far so good. But now >>> x = root.find('x') >>> z = etree.SubElement(x, 'z') >>> etree.SubElement(z, 'z1').attrib['value'] = 'val1' >>> print etree.tostring(root, pretty_print = True) <root> <x> <y>test1<

How to use regular expression in lxml xpath?

ぐ巨炮叔叔 提交于 2019-11-27 03:31:57
I'm using construction like this: doc = parse(url).getroot() links = doc.xpath("//a[text()='some text']") But I need to select all links which have text beginning with "some text", so I'm wondering is there any way to use regexp here? Didn't find anything in lxml documentation You can do this (although you don't need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class , but it also works for the xpath() method) doc.xpath("//a[re:match(text(), 'some text')]", namespaces={"re": "http://exslt.org

can't installing lxml on Ubuntu 12.04

心不动则不痛 提交于 2019-11-27 03:28:07
问题 I've been trying to install lxml using pip install lxml and I get the error below. I've used apt-get install python-dev libxml2 libxml2-dev libxslt-dev before (suggested in other answers) but I still get the same error. I did not use control-c. pip install lxml Downloading/unpacking lxml Downloading lxml-3.2.4.tar.gz (3.3MB): 3.3MB downloaded Running setup.py egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url' warnings