lxml

lxml xpath doesn't ignore “ ”

五迷三道 提交于 2019-11-30 18:35:44
问题 I have this HTML: <td class="0"> <b>Bold Text</b>  <a href=""></a> </td> <td class="0"> Regular Text  <a href=""></a> </td> Which, when formatted with xpath... new_html = tree.xpath('//td[@class="0"]/text() | //td[@class="0"]/b/text()') Prints: ['Bold Text', '', 'Regular Text'] As you can see, the   character hasn't been ignored and is actually read as an extra entry in td. How can I get a better output? 回答1: Instead, I'd iterate over all the desired td elements and get the .text_content():

爬虫常用库

那年仲夏 提交于 2019-11-30 18:12:35
1.requests 1.获取某一个网页 requests.get(url) 请求的参数可以设置headers等。 2.各种请求 requests.post(url,data = {}) requests.delete(url) requests.head(url) requests.options(url) 3.获取网页cookie response = requests.get(url) response.cookies for key,value in response.cookies.items(): print(key + "="+value) 让然获取的从cookie是可以加入到请求的(url,cookies = cookies) 一般也会使用cookie jar jar = requests.cookies.RequestsCookieJar() jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies') jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere') response = requests.get(url, cookies=jar) print(response.text)

lxml.html. Error reading file; Failed to load external entity

风格不统一 提交于 2019-11-30 17:50:24
问题 I am trying to get a movie trailer url from YouTube using parsing with lxml.html: from lxml import html import lxml.html from lxml.etree import XPath def get_youtube_trailer(selected_movie): # Create the url for the YouTube query in order to find the movie trailer title = selected_movie t = {'search_query' : title + ' movie trailer'} query_youtube = urllib.urlencode(t) search_url_youtube = 'https://www.youtube.com/results?' + query_youtube # Define the XPath for the YouTube movie trailer link

How to parse malformed HTML in python

萝らか妹 提交于 2019-11-30 17:27:17
I need to browse the DOM tree of a parsed HTML document. I'm using uTidyLib before parsing the string with lxml a = tidy.parseString(html_code, options) dom = etree.fromstring(str(a)) sometimes I get an error, it seems that tidylib is not able to repair malformed html. how can I parse every HTML file without getting an error (parsing only some parts of files that can not be repaired)? Beautiful Soup does a good job with invalid/broken HTML >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup("<htm@)($*><body><table <tr><td>hi</tr></td></body><html") >>> print soup.prettify()

Parsing Source Code (Python) Approach: Beautiful Soup, lxml, html5lib difference?

六月ゝ 毕业季﹏ 提交于 2019-11-30 16:45:33
I have a large HTML source code I would like to parse (~200,000) lines, and I'm fairly certain there is some poor formatting throughout. I've been researching some parsers, and it seems Beautiful Soup, lxml, html5lib are the most popular. From reading this website, it seems lxml is the most commonly used and fastest, while Beautiful Soup is slower but accounts for more errors and variation. I'm a little confused on the Beautiful Soup documentation, http://www.crummy.com/software/BeautifulSoup/bs4/doc/ , and commands like BeautifulSoup(markup, "lxml") or BeautifulSoup(markup, html5lib). In such

How to install lxml in Python 3.4 on Windows machine

馋奶兔 提交于 2019-11-30 16:29:35
I've been spending hours on this. I'm new to Python and can't see what the solution may be. I have Python 3.4 and want to work with .docx , which requires lxml . The workflow I've done so far is: I go to the Python lxml package installer page, but it's quite confusing to know which version I need. I tried with several of them that contained the 34 numbers, both .exe and .tar . I also tried pip install lxml3.4.4 and pip install lxml 3.4.4 . None of them worked either. This is what the command prompt says when I did pip install lxml (it automatically grabs the lxml 3.4.4 I've downloaded and then

Need help installing lxml on os x 10.7

ぃ、小莉子 提交于 2019-11-30 15:20:11
I have been struggling to be able to do from lxml import etree ( import lxml works fine by the way) The error is: ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/lxml/etree.so, 2): Symbol not found: _htmlParseChunk Referenced from: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/etree.so Expected in: flat namespace in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/etree.so i used pip to install lxml, and homebrew to reinstall libxml2 with the right architecture (or so i

Unable to pass an lxml etree object to a separate process

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-30 14:51:56
I'm working on a project to parse multiple xml files concurrently in python using lxml. When I initialize the process I want my main class to do some work on the XML before it passes the etree object to the process, but I am finding that when the etree object arrives in the new process the class survives but the XML is gone from within the object and getroot() returns None. I know that I can only pass pickable data using the queue, but is this also the case with what I pass to the process inside the 'args' field? Here's my code: import multiprocessing, multiprocessing.pool, time from lxml

Efficient way of XML parsing in ElementTree(1.3.0) Python

淺唱寂寞╮ 提交于 2019-11-30 14:47:10
I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django). Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text <?xml VERSION="1.0" encoding="ISO-8859-1"?> <mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0

lxml.html parsing with XPath and variables

一个人想着一个人 提交于 2019-11-30 14:41:42
I have this HTML snippet <div id="dw__toc"> <h3 class="toggle">Table of Contents</h3> <div> <ul class="toc"> <li class="level1"><div class="li"><a href="#section">#</a></div> <ul class="toc"> <li class="level2"><div class="li"><a href="#link1">One</a></div></li> <li class="level2"><div class="li"><a href="#link2">Two</a></div></li> <li class="level2"><div class="li"><a href="#link3">Three</a></div></li> Now I want to parse it with lxml.html. In the end I want a function where I can provide a searchterm (i.e. "one") and the function should return One #link1 For now I'm trying to get a variable