scrapy

Scrapy deploy stopped working

冷暖自知 提交于 2020-01-15 15:55:10
问题 I am trying to deploy scrapy project using scrapyd but it is giving me error ... sudo scrapy deploy default -p eScraper Building egg of eScraper-1371463750 'build/scripts-2.7' does not exist -- can't clean it zip_safe flag not set; analyzing archive contents... eScraperInterface.settings: module references __file__ eScraper.settings: module references __file__ Deploying eScraper-1371463750 to http://localhost:6800/addversion.json Server response (200): Traceback (most recent call last): File

DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

自闭症网瘾萝莉.ら 提交于 2020-01-15 12:24:11
问题 This question is an extension of the resolved question here, ie. Crawling linkedin while authenticating with scrapy. Crawling LinkedIn while authenticated with Scrapy @Gates While I keep the base of the script the same, only adding my own session_key and session_password - and after changing the start url particular to my use-case, as below. class LinkedPySpider(InitSpider): name = 'Linkedin' allowed_domains = ['linkedin.com'] login_page = 'https://www.linkedin.com/uas/login' start_urls=[

DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

好久不见. 提交于 2020-01-15 12:24:08
问题 This question is an extension of the resolved question here, ie. Crawling linkedin while authenticating with scrapy. Crawling LinkedIn while authenticated with Scrapy @Gates While I keep the base of the script the same, only adding my own session_key and session_password - and after changing the start url particular to my use-case, as below. class LinkedPySpider(InitSpider): name = 'Linkedin' allowed_domains = ['linkedin.com'] login_page = 'https://www.linkedin.com/uas/login' start_urls=[

crawl a list of sites one by one with scrapy

∥☆過路亽.° 提交于 2020-01-15 10:33:51
问题 I am trying to crawl a list of sites with scrapy . I tried to put the list of website urls as the start_urls , but then I found I couldn't afford so much memory with it. Is there any way to set the scrapy crawling one or two sites at a time? 回答1: You can try using concurrent_requests = 1 so that you don't overloaded with data http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests 回答2: You can define a start_requests method which iterates through requests to your URLs. This

Scrapy merge subsite-item with site-item

家住魔仙堡 提交于 2020-01-15 09:35:08
问题 Im trying to scrape details from a subsite and merge with the details scraped with site . I've been researching through stackoverflow, as well as documentation. However, I still cant get my code to work. It seems that my function to extract additional details from the subsite does not work. If anyone could take a look I would be very grateful. # -*- coding: utf-8 -*- from scrapy.spiders import Spider from scrapy.selector import Selector from scrapeInfo.items import infoItem import pyodbc

unable to deploy scrapy project

白昼怎懂夜的黑 提交于 2020-01-15 08:05:44
问题 Suddenly my scrapy deployment is started getting failed : sudo scrapy deploy default -p eScraper Password: Building egg of eScraper-1372327569 'build/scripts-2.7' does not exist -- can't clean it zip_safe flag not set; analyzing archive contents... eScraper.settings: module references __file__ eScraperInterface.settings: module references __file__ Deploying eScraper-1372327569 to http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "OSError: [Errno 20]

unable to deploy scrapy project

人走茶凉 提交于 2020-01-15 08:03:18
问题 Suddenly my scrapy deployment is started getting failed : sudo scrapy deploy default -p eScraper Password: Building egg of eScraper-1372327569 'build/scripts-2.7' does not exist -- can't clean it zip_safe flag not set; analyzing archive contents... eScraper.settings: module references __file__ eScraperInterface.settings: module references __file__ Deploying eScraper-1372327569 to http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "OSError: [Errno 20]

scrapy进阶《封号码罗》之如何优雅的征服世界首富亚马逊(amazon.com)

最后都变了- 提交于 2020-01-15 06:03:04
免责声明:原创文章,仅用于学习,希望看到文章的朋友,不要随意用于商业,本作者保留当前文章的所有法律权利。欢迎评论,点赞,收藏,转发! 关于亚马逊的爬虫,针对不同的使用场合,前前后后写了有六七个了,今天拿出其中一个爬虫,也是相对其他几个爬虫难度稍微大一些的,这个爬虫用到了我之前没有使用过的一个爬虫手法,虽然头疼了一天半的时间,不过最终还是写出来了! 先上爬到结果,一睹芳容! # settings.py我也就是修改了一下请求头,没有别的什么参数好设置的 DEFAULT_REQUEST_HEADERS = { 'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' , 'accept-encoding' 来源: CSDN 作者: Python 键盘上的舞者 链接: https://blog.csdn.net/Python_DJ/article/details/103830378

Python笔记:爬虫框架Scrapy抓取数据案例实战解析包含项目代码

 ̄綄美尐妖づ 提交于 2020-01-14 19:57:14
概述 本项目通过Scrapy框架的爬虫实战案例来巩固该项技术栈,并用于日后回忆和反思 任务: 爬取 careers.tencent.com 中关于指定条件的所有社会招聘信息 搜索条件为 中国 AI 关键字的就业岗位 并将信息存储到MySql数据库中 地址:https://careers.tencent.com 步骤: 首先爬取每页的招聘信息列表 再爬取对应的招聘详情信息 分析: 方案1 经过页面分析详情页所需id可在列表页分享下结构的div中获取(如果页面上没有具体链接地址,那么跳转程序就很可能在js脚本中或者跳转地址在接口数据中) 打开源代码查看,发现代码非常少,经过构建程序处理过,并且所有数据都是后加载出来的 此方案不可取 (方案1 图例) 方案2 解析ajax请求数据并进行处理 列表接口 举例: https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1578972041752&countryId=1&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn 简化 列表接口 请求参数: https://careers

基于scrapy的搜索引擎(三):爬取问答网站

梦想与她 提交于 2020-01-14 17:54:58
起步 首先利用selenium获取登录知乎后的cookies,夹着cookies对知乎首页发送response请求,然后再对首页中的热榜(可以修改start_urls)下的所有问答页面进行爬取,并将字段异步插入到mysql中 爬取知乎热榜 通过selenium获取登录后的cookies,并将cookies以json的格式保存在zhihuCookies,json中: def loginZhihu ( self ) : loginurl = 'https://www.zhihu.com/signin' # 加载webdriver驱动,用于获取登录页面标签属性 driver = webdriver . Chrome ( ) driver . get ( loginurl ) # 扫描二维码前,让程序休眠10s time . sleep ( 10 ) input ( "请点击并扫描页面二维码,手机确认登录后,回编辑器点击回车:" ) # 获取登录后的cookies cookies = driver . get_cookies ( ) driver . close ( ) # 保存cookies,之后请求从文件中读取cookies就可以省去每次都要登录一次的,也可以通过return返回,每次执行先运行登录方法 # 保存成本地json文件 jsonCookies = json . dumps