scrapy

Scrapy 安装

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-21 14:03:45
参考: http://www.open-open.com/lib/view/open1420624463656.html 抓取网站的代码实现很多,如果考虑到抓取下载大量内容,scrapy框架无疑是一个很好 的工具。下面简单列出安装过程。PS:一定要按照Python的版本下载,要不然安装的时候会提醒找不到Python。 1.安装Python 安装完了记得配置环境,将python目录和python目录下的Scripts目录添加到系统环境变量的Path里(在python2.7以后版本中,安装过程中会有个选项:添加到python到path,勾选即可)。在cmd中输入python如果出现版本信息说明配置完毕(如下面截图)。python下载地址: https://www.python.org/downloads/ 。 2.安装setuptools 或者 pip ubuntu linux: sudo apt-get install python-pip windows:点击 https://pypi.python.org/pypi/pip 下载 pip-6.1.1.tar.gz ( md5 , pgp ) 解压后进入文件夹执行:python setup.py install 或者直接下载exe文件进行安装,下载地址为: http://www.lfd.uci.edu/~gohlke

Scrapy with a nested array

强颜欢笑 提交于 2019-12-21 13:57:30
问题 I'm new to scrapy and would like to understand how to scrape on object for output into nested JSON. Right now, I'm producing JSON that looks like [ {'a' : 1, 'b' : '2', 'c' : 3}, ] And I'd like it more like this: [ { 'a' : '1', '_junk' : [ 'b' : 2, 'c' : 3]}, ] ---where I put some stuff in _junk subfields to post-process later. The current code under the parser definition file in my scrapername.py is... item['a'] = x item['b'] = y item['c'] = z And it seemed like item['a'] = x item['_junk'][

Scrapy with selenium, webdriver failing to instantiate

旧城冷巷雨未停 提交于 2019-12-21 12:27:47
问题 I am trying to use selenium/phantomjs with scrapy and I'm riddled with errors. For example, take the following code snippet: def parse(self, resposne): while True: try: driver = webdriver.PhantomJS() # do some stuff driver.quit() break except (WebDriverException, TimeoutException): try: driver.quit() except UnboundLocalError: print "Driver failed to instantiate" time.sleep(3) continue A lot of the times the driver it seems it has failed to instantiate (so the driver is unbound, hence the

scrapy get the entire text including children

大城市里の小女人 提交于 2019-12-21 12:10:15
问题 I have a series of <p> elements inside a document I'm scraping with scrapy. some of the are: <p><span>bla bla bla</span></p> or <p><span><span>bla bla bla</span><span>second bla bla</span></span></p> I want to extract all the text with the children (assume I already have the selector of the <p ) (second example: to have a string bla bla bla second bla bla ) 回答1: you can just use //text() to extract all text from children nodes for example: .//p//text() 来源: https://stackoverflow.com/questions

Scrapy Crawler in python cannot follow links?

家住魔仙堡 提交于 2019-12-21 12:08:41
问题 I wrote a crawler in python using the scrapy tool of python. The following is the python code: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector #from scrapy.item import Item from a11ypi.items import AYpiItem class AYpiSpider(CrawlSpider): name = "AYpi" allowed_domains = ["a11y.in"] start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"] rules =( Rule(SgmlLinkExtractor(allow =

Combining base url with resultant href in scrapy

爱⌒轻易说出口 提交于 2019-12-21 09:07:28
问题 below is my spider code, class Blurb2Spider(BaseSpider): name = "blurb2" allowed_domains = ["www.domain.com"] def start_requests(self): yield self.make_requests_from_url("http://www.domain.com/bookstore/new") def parse(self, response): hxs = HtmlXPathSelector(response) urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract() for i in urls: yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url) def parse_url(self, response): hxs = HtmlXPathSelector

Scrapy框架安装配置小结

北慕城南 提交于 2019-12-21 07:20:35
Windows 平台: 系统是 Win7 Python 2.7.7版本 官网文档: http://doc.scrapy.org/en/latest/intro/install.html 1.安装Python 电脑中安装好 Python 2.7.7 版本,安装完之后需要配置环境变量,比如我的安装在D盘,D:\python2.7.7,就把以下两个路径添加到Path变量中 1 D : \ python2 . 7.7 ; D : \ python2 . 7.7 \ Scripts 配置好了之后,在命令行中输入 python –version,如果没有提示错误,则安装成功 2.安装pywin32 在windows下,必须安装pywin32,在 http://sourceforge.net/projects/pywin32/files/ 这里点击进去后选择对应的版本(注意要与安装的python版本对应),下载后也是双击运行,直接下一步一路完成。 安装完毕之后验证: 在python命令行下输入 import win32com 如果没有提示错误,则证明安装成功 3.安装pip pip是用来安装其他必要包的工具,首先下载 get-pip.py 下载好之后,选中该文件所在路径,执行下面的命令 1 python get - pip . py 执行命令后便会安装好pip,并且同时,它帮你安装了

scrapy scrap on all pages that have this syntax

旧街凉风 提交于 2019-12-21 06:57:39
问题 I want to scrapy on all pages that have this syntaxt mywebsite/?page=INTEGER I tried this: start_urls = ['MyWebsite'] rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')] but it seems that the link still MyWebsite . so please what should I do to make it understand that i want to add /?page=NumberOfPage ? please? edit i mean that i want to scrap these pages: mywebsite/?page=1 mywebsite/?page=2 mywebsite/?page=3 mywebsite/?page=4 mywebsite/?page=5 .. .. .. mywebsite/?page=7677654

webpage returns 405 status code error when accessed with scrapy

让人想犯罪 __ 提交于 2019-12-21 06:29:29
问题 I am trying to scrap below URL with scrapy - https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n but, It always ends up giving status 405 error. I have searched about this topic but they always say that it occurs when the request method is incorrect, like POST in place of GET. But this is surely not the case here. here is my code for spider - import scrapy class sampleSpider(scrapy.Spider): AUTOTHROTTLE_ENABLED = True name

webpage returns 405 status code error when accessed with scrapy

被刻印的时光 ゝ 提交于 2019-12-21 06:29:09
问题 I am trying to scrap below URL with scrapy - https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n but, It always ends up giving status 405 error. I have searched about this topic but they always say that it occurs when the request method is incorrect, like POST in place of GET. But this is surely not the case here. here is my code for spider - import scrapy class sampleSpider(scrapy.Spider): AUTOTHROTTLE_ENABLED = True name