lxml

beautifulsoup won't recognize lxml

人走茶凉 提交于 2019-11-29 03:43:21
I'm attempting to use lxml as the parser for BeautifulSoup because the default one is MUCH slower, however i'm getting this error: soup = BeautifulSoup(html, "lxml") File "/home/rob/python/stock/local/lib/python2.7/site-packages/bs4/__init__.py", line 152, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? I have uninstalled and reinstalled lxml as well as beautifulsoup many times, however it still will not read it. I've tried reinstalled lxml dependencies as well and i'm still

Preserving original doctype and declaration of an lxml.etree parsed xml

二次信任 提交于 2019-11-29 03:32:40
I'm using python's lxml and I'm trying to read an xml document, modify and write it back but the original doctype and xml declaration disappears. I'm wondering if there's an easy way of putting it back in whether through lxml or some other solution? John Keyes tl;dr # adds declaration with version and encoding regardless of # which attributes were present in the original declaration # expects utf-8 encoding (encode/decode calls) # depending on your needs you might want to improve that from lxml import etree from xml.dom.minidom import parseString xml1 = '''\ <?xml version="1.0" encoding="UTF-8

WebScraping with BeautifulSoup or LXML.HTML

眉间皱痕 提交于 2019-11-29 02:46:52
I have seen some webcasts and need help in trying to do this: I have been using lxml.html. Yahoo recently changed the web structure. target page; http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true In Chrome using inspector: I see the data in //*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table then some more code How Do get this data out into a list. I want to change to other stock from "LLY" to "Msft"? How do I switch between dates....And get all months. I know you said you can't use lxml.html . But here is how to do it using that library, because

Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?

≡放荡痞女 提交于 2019-11-29 02:24:43
I know how to parse a page using Python. My question is which is the fastest method of all parsing techniques, how fast is it from others? The parsing techniques I know are Xpath, DOM, BeautifulSoup, and using the find method of Python. http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ lxml was written on C. And if you use x86 it is best chose. If we speak about techniques there is no big difference between Xpath and DOM - it's very quickly methods. But if you will use find or findAll in BeautifulSoup it will be slow than other. BeautifulSoup was written on Python. This lib

error with parse function in lxml

半城伤御伤魂 提交于 2019-11-29 01:45:27
i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).i tried this simple command: from lxml.html import parse p= parse(‘http://www.google.com’).getroot() but i am getting the following error: Traceback (most recent call last): File “”, line 1, in p=parse(‘http://www.google.com’).getroot() File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw) File “lxml.etree.pyx”, line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:49590) File “parser.pxi”, line 1491, in

关于py中lxml模块的cssselect的小问题

萝らか妹 提交于 2019-11-29 01:44:31
今天在使用lxml进行解析页面的时候遇到了不能解析空格的问题,就是类似于: <div class="aa bb"></div> 使用cssselect('.aa bb')是取不出来的,解决方法是将空格转换为 . 点, 也就是使用 new.cssselect('.aa.bb')   这样就可以取到标签了 来源: https://www.cnblogs.com/lxjhua/p/11438180.html

lxml parser eats all memory

女生的网名这么多〃 提交于 2019-11-29 01:42:45
I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb. i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem. The problem in this line of code: HTML = lxml.html.fromstring(htmltext) Maybe someone know what it can be, or hoe to fix this? Thanks for help. P.S. Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC

Parsing HTML with Lxml

前提是你 提交于 2019-11-29 01:10:52
问题 I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me. Here is the page I am trying to parse, I need to get the text under the "Additional Info" section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some

python-爬虫基础-lxml.etree(3)-Elementtree类

不问归期 提交于 2019-11-29 00:53:06
''' Elementtree 主要是一个包装在具有根节点的树周围的文档。 它提供了一些用于序列化和一般文档处理的方法。 ''' root = etree.XML('''\ <?xml version="1.0"?> <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]> <root> <a>&tasty;</a> </root> ''') tree = etree.ElementTree(root) print(tree.docinfo.xml_version) print(tree.docinfo.doctype) tree.docinfo.public_id = '-//W3C//DTD XHTML 1.0 Transitional//EN' tree.docinfo.system_url = 'file://local.dtd' print(tree.docinfo.doctype) ''' 当您调用 parse ()函数来解析文件或类似文件的对象(请参阅下面的解析部分)时,也会得到 ElementTree。 其中一个重要的区别是 ElementTree 类序列化为一个完整的文档,而不是单个 Element。 这包括顶级处理指令和注释,以及文档中的 DOCTYPE 和其他 DTD 内容: ''' print

How to match a text node then follow parent nodes using XPath

感情迁移 提交于 2019-11-29 00:40:25
问题 I'm trying to parse some HTML with XPath. Following the simplified XML example below, I want to match the string 'Text 1', then grab the contents of the relevant content node. <doc> <block> <title>Text 1</title> <content>Stuff I want</content> </block> <block> <title>Text 2</title> <content>Stuff I don't want</content> </block> </doc> My Python code throws a wobbly: >>> from lxml import etree >>> >>> tree = etree.XML("<doc><block><title>Text 1</title><content>Stuff I want</content></block>