lxml

爬虫练手项目:获取豆瓣评分最高的电影并下载

喜你入骨 提交于 2019-12-04 01:27:49
前期回顾 上篇博文我们学习了Python爬虫的四大库 urllib , requests , BeautifulSoup 以及 selenium 爬虫常用库介绍 学习了 urllib 与 request 的常见用法 学习了使用 BeautifulSoup 来解析网页以及使用 selenium 来驱动浏览器 # 我们导入了 web 驱动模块 from selenium import webdriver # 接着我们创建了一个 Chrome 驱动 driver = webdriver.Chrome() # 接着使用 get 方法打开百度 driver.get("https://www.baidu.com") # 获取输入框并且往里面写入我们要搜索的内容 input = driver.find_element_by_css_selector('#kw') input.send_keys("波多野结衣照片") # 我们就获取到搜索这个按钮然后点击 button = driver.find_element_by_css_selector('#su') button.click() 则是上次查看波多老师图片的代码,效果如下 抓取豆瓣电影并保存本地 我们来抓取一下豆瓣上排名前250的电影 import requests from bs4 import BeautifulSoup import

How to debug Python memory fault?

北战南征 提交于 2019-12-04 01:21:12
Edit: Really appreciate help in finding bug - but since it might prove hard to find/reproduce, any general debug help would be greatly appreciated too! Help me help myself! =) Edit 2: Narrowing it down, commenting out code. Edit 3: Seems lxml might not be the culprit, thanks! The full script is here . I need to go over it looking for references. What do they look like? Edit 4: Actually, the scripts stops (goes 100%) in this, the parse_og part of it. So edit 3 is false - it must be lxml somehow. Edit 5 MAJOR EDIT: As suggested by David Robinson and TankorSmash below, I've found a type of data

How to get raw XML back from lxml?

ぃ、小莉子 提交于 2019-12-04 01:09:30
问题 I'm using the following code to locate a div: parser = etree.HTMLParser() tree = etree.parse(StringIO(page), parser) div = tree.xpath("//div[@class='content']")[0] My only problem is, that after doing this I do not want to rely on lxml to extract the contents of said div: I just want to get back the raw XML the div contains. Is this doable or do I have to abandon this method entirely? 回答1: I think you are looking for: etree.tostring(div) 回答2: Did you try tostring ? raw_xml = etree.tostring

Receiving 'ImportError: cannot import name etree' when using lxml in Python on Mac

百般思念 提交于 2019-12-04 00:29:10
问题 I'm having difficulty properly installing lxml for Python on Mac. I have followed the instructions here, which after installation indicates that the installation is successful (however, there are some warnings. The full log of the install and warnings can be found here) After running the install, I am trying to run Test.py in the lxml install directory to ensure that it's working correctly. I am immediately prompted with the error: ImportError" cannot import name etree. This error results

Remove class attribute from HTML using Python and lxml

夙愿已清 提交于 2019-12-03 23:17:11
Question How do I remove class attributes from html using python and lxml? Example I have: <p class="DumbClass">Lorem ipsum dolor sit amet, consectetur adipisicing elit</p> I want: <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit</p> What I've tried so far I've checked out lxml.html.clean.Cleaner however, it does not have a method to strip out class attributes. You can set safe_attrs_only=True however, this does not remove the class attribute. Significant searching has turned up nothing workable. I think the fact that class is used in both html and python further muddies search

How to use python to get google news headlines and search keywords?

不想你离开。 提交于 2019-12-03 23:08:51
I am working on a project to look through google news headlines and find keywords. I want it to: -put the headlines into a text file -remove commas, apostrophes, quotes, punctuation, etc -search keywords This is the code I have so far. I am getting the headlines, I now just need it to parse the keywords from each individual headline. from lxml import html import requests # Send request to get the web page response = requests.get('http://news.google.com') # Check if the request succeeded (response code 200) if (response.status_code == 200): # Parse the html from the webpage pagehtml = html

Error 'failed to load external entity' when using Python lxml

心已入冬 提交于 2019-12-03 22:20:33
I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error: ': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?> That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far: import urllib2 import lxml.etree as etree file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml") data = file.read() file

XHTML namespace issues with cssselect in lxml

落爺英雄遲暮 提交于 2019-12-03 22:17:03
I have problems using cssselect with a XHTML (or XML with namespace). Although the documentation says how to use namespace in csselect I do not understand it: cssselect namespaces My Input XHTML string: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Teststylesheet</title> <style type="text/css"> /*<![CDATA[*/ ol{margin:0;padding:0} /*]]>*/ </style> </head> <body> </body> </html> My Python Script: parser = etree.XMLParser() tree = etree.fromstring(xhtmlstring, parser)

Get data between two tags in Python

允我心安 提交于 2019-12-03 21:49:32
<h3> <a href="article.jsp?tp=&arnumber=16"> Granular computing based <span class="snippet">data</span> <span class="snippet">mining</span> in the views of rough set and fuzzy set </a> </h3> Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set I tried using lxml parser = etree.HTMLParser() tree = etree.parse(StringIO.StringIO(html), parser) xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()" rawResponse = tree.xpath(xpath1) print rawResponse and getting the following output ['\r\n\t\t','\r

Error parsing a DTD using lxml

て烟熏妆下的殇ゞ 提交于 2019-12-03 21:11:21
I'm trying to write a validation script that will validate XML against the NITF DTD, http://www.iptc.org/std/NITF/3.4/specification/dtd/nitf-3-4.dtd . Based on this post I came up with the following simple script to validate a NITF XML document. Bellow is the error message I get when the script is run, which isn't very descriptive and makes it hard to debug. Any help is appreciated. #!/usr/bin/env python def main(): from lxml import etree, objectify from StringIO import StringIO f = open('nitf_test.xml') xml_doc = f.read() f.close() f = open('nitf-3-4.dtd') dtd_doc = f.read() f.close() dtd =