scrapy | 易学教程

Scrapy 安装

阅读更多关于 Scrapy 安装

参考： http://www.open-open.com/lib/view/open1420624463656.html 抓取网站的代码实现很多，如果考虑到抓取下载大量内容，scrapy框架无疑是一个很好的工具。下面简单列出安装过程。PS：一定要按照Python的版本下载，要不然安装的时候会提醒找不到Python。 1.安装Python 安装完了记得配置环境，将python目录和python目录下的Scripts目录添加到系统环境变量的Path里（在python2.7以后版本中，安装过程中会有个选项：添加到python到path，勾选即可）。在cmd中输入python如果出现版本信息说明配置完毕（如下面截图）。python下载地址: https://www.python.org/downloads/ 。 2.安装setuptools 或者 pip ubuntu linux: sudo apt-get install python-pip windows:点击 https://pypi.python.org/pypi/pip 下载 pip-6.1.1.tar.gz ( md5 , pgp ) 解压后进入文件夹执行：python setup.py install 或者直接下载exe文件进行安装，下载地址为： http://www.lfd.uci.edu/~gohlke

Scrapy with a nested array

阅读更多关于 Scrapy with a nested array

问题 I'm new to scrapy and would like to understand how to scrape on object for output into nested JSON. Right now, I'm producing JSON that looks like [ {'a' : 1, 'b' : '2', 'c' : 3}, ] And I'd like it more like this: [ { 'a' : '1', '_junk' : [ 'b' : 2, 'c' : 3]}, ] ---where I put some stuff in _junk subfields to post-process later. The current code under the parser definition file in my scrapername.py is... item['a'] = x item['b'] = y item['c'] = z And it seemed like item['a'] = x item['_junk'][

Scrapy with selenium, webdriver failing to instantiate

阅读更多关于 Scrapy with selenium, webdriver failing to instantiate

问题 I am trying to use selenium/phantomjs with scrapy and I'm riddled with errors. For example, take the following code snippet: def parse(self, resposne): while True: try: driver = webdriver.PhantomJS() # do some stuff driver.quit() break except (WebDriverException, TimeoutException): try: driver.quit() except UnboundLocalError: print "Driver failed to instantiate" time.sleep(3) continue A lot of the times the driver it seems it has failed to instantiate (so the driver is unbound, hence the

scrapy get the entire text including children

阅读更多关于 scrapy get the entire text including children

问题 I have a series of elements inside a document I'm scraping with scrapy. some of the are: bla bla bla or bla bla blasecond bla bla I want to extract all the text with the children (assume I already have the selector of the <p ) (second example: to have a string bla bla bla second bla bla ) 回答1: you can just use //text() to extract all text from children nodes for example: .//p//text() 来源： https://stackoverflow.com/questions

Scrapy Crawler in python cannot follow links?

阅读更多关于 Scrapy Crawler in python cannot follow links?

问题 I wrote a crawler in python using the scrapy tool of python. The following is the python code: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector #from scrapy.item import Item from a11ypi.items import AYpiItem class AYpiSpider(CrawlSpider): name = "AYpi" allowed_domains = ["a11y.in"] start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"] rules =( Rule(SgmlLinkExtractor(allow =

Combining base url with resultant href in scrapy

阅读更多关于 Combining base url with resultant href in scrapy

问题 below is my spider code, class Blurb2Spider(BaseSpider): name = "blurb2" allowed_domains = ["www.domain.com"] def start_requests(self): yield self.make_requests_from_url("http://www.domain.com/bookstore/new") def parse(self, response): hxs = HtmlXPathSelector(response) urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract() for i in urls: yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url) def parse_url(self, response): hxs = HtmlXPathSelector

Scrapy框架安装配置小结

阅读更多关于 Scrapy框架安装配置小结

Windows 平台：系统是 Win7 Python 2.7.7版本官网文档： http://doc.scrapy.org/en/latest/intro/install.html 1.安装Python 电脑中安装好 Python 2.7.7 版本，安装完之后需要配置环境变量，比如我的安装在D盘，D:\python2.7.7，就把以下两个路径添加到Path变量中 1 D : \ python2 . 7.7 ; D : \ python2 . 7.7 \ Scripts 配置好了之后，在命令行中输入 python –version，如果没有提示错误，则安装成功 2.安装pywin32 在windows下，必须安装pywin32，在 http://sourceforge.net/projects/pywin32/files/ 这里点击进去后选择对应的版本（注意要与安装的python版本对应），下载后也是双击运行，直接下一步一路完成。安装完毕之后验证：在python命令行下输入 import win32com 如果没有提示错误，则证明安装成功 3.安装pip pip是用来安装其他必要包的工具，首先下载 get-pip.py 下载好之后，选中该文件所在路径，执行下面的命令 1 python get - pip . py 执行命令后便会安装好pip，并且同时，它帮你安装了

scrapy scrap on all pages that have this syntax

阅读更多关于 scrapy scrap on all pages that have this syntax

问题 I want to scrapy on all pages that have this syntaxt mywebsite/?page=INTEGER I tried this: start_urls = ['MyWebsite'] rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')] but it seems that the link still MyWebsite . so please what should I do to make it understand that i want to add /?page=NumberOfPage ? please? edit i mean that i want to scrap these pages: mywebsite/?page=1 mywebsite/?page=2 mywebsite/?page=3 mywebsite/?page=4 mywebsite/?page=5 .. .. .. mywebsite/?page=7677654

webpage returns 405 status code error when accessed with scrapy

阅读更多关于 webpage returns 405 status code error when accessed with scrapy

问题 I am trying to scrap below URL with scrapy - https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n but, It always ends up giving status 405 error. I have searched about this topic but they always say that it occurs when the request method is incorrect, like POST in place of GET. But this is surely not the case here. here is my code for spider - import scrapy class sampleSpider(scrapy.Spider): AUTOTHROTTLE_ENABLED = True name

webpage returns 405 status code error when accessed with scrapy

阅读更多关于 webpage returns 405 status code error when accessed with scrapy