Scrapyd

天大地大妈咪最大 提交于 2020-11-27 02:28:28

 scrapyd

安装

scrapyd-中心节点,子节点安装scrapyd-client
pip3 install scrapyd
pip3 install scrapyd-client

scrapyd-client两个作用 

把本地的代码打包生成egg包
把egg上传到远程的服务器上

 windows配置scrapyd-deploy

H:\Python36\Scripts下创建scrapyd-deploy.bat

python H:/Python36/Scripts/scrapyd-deploy %*

curl.exe放入H:\Python36\Scripts


启动scrapyd启动服务!!!!!!!!!!

scrapyd-deploy 查询


切换到scrapy中cmd运行scrapyd-deploy
H:\DDD-scrapy\douban>scrapyd-deploy

scrapyd-deploy -l

 


scrapy中scrapy.cfg修改配置
[deploy:dj] #开发项目名
url = http://localhost:6800/
project = douban  #项目名

scrapy list这个命令确认可用,结果dang是spider名

如不可用,setting中添加以下

scrapyd-deploy 添加爬虫

scrapyd-deploy dj -p douban

H:/Python36/Scripts/scrapyd-deploy dang1 -p dangdang 

dang1是开发项目名,dangdang是项目名

 

 

http://127.0.0.1:6800/jobs

 

scrapyd 当前项目的状态
curl http://localhost:6800/daemonstatus.json

 curl启动爬虫

curl http://localhost:6800/schedule.json -d project=dangdang -d spider=dang

dangdang是项目名,dang是爬虫名

 


project=douban,spider=doubanlogin
curl http://localhost:6800/schedule.json -d project=douban -d spider=doubanlogin

 

 

 远程部署:scrapyd-deploy

scrapyd-deploy 17k -p my17k

 

列出所有已经上传的spider

curl http://47.97.169.234:6800/listprojects.json

列出当前项目project的版本version

curl http://47.97.169.234:6800/listversions.json\?project\=my17k

远程启动spider

curl http://47.97.169.234:6800/schedule.json -d project=my17k-d spider=mys1

 

 运行三次,就类似于开启三个进程

curl http://47.97.169.234:6800/schedule.json -d project=my17k-d spider=mys1

查看当前项目的运行情况

取消爬虫项目的job任务

curl http://localhost:6800/cancel.json -d project=dangdang -d job=68d25db0506111e9a4c0e2df1c2eb35b

 

curl http://47.97.169.234:6800/cancel.json -d project=my17k -d job=be6ed036508611e9b68000163e08dec9

 

注意事项:
1,编写spider时,不要忘记加 **kwargs 参数;
2,spider中__init__函数中所加额外参数,需在curl中体现:
如:__init__(self, ctime=none, **kwargs)
则curl命令中需包含:-d ctime=20150517
如上述开始爬虫应为:  $ curl http://localhost:6800/schedule.json -d project=stock_uc -d spider=stockuc -d ctime=20150517
3,要想停止爬虫,需要两次运行:  curl http://localhost:6800/cancel.json -d project=stock_uc -d spider=stockuc -d job=862894f6fc8811e4896e001e64619b8c

 centos配置scrapyd

https://www.cnblogs.com/ss-py/p/9661928.html

安装后新建一个配置文件:

  sudo mkdir /etc/scrapyd

  sudo vim /etc/scrapyd/scrapyd.conf

写入如下内容:(给内容在https://scrapyd.readthedocs.io/en/stable/config.html可找到)

[scrapyd]
eggs_dir    = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 10 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus

scrapyd-api

安装

 scrapyd-api对scrapyd进行了一些封装

from scrapyd_api import ScrapydAPI

scrapyd=ScrapydAPI('http://47.97.169.234:6800')

 

使用

显示所有的projects

scrapyd.list_projects()

显示该项目下的spiders

scrapyd.list_spiders('my17k')

from scrapyd_api import ScrapydAPI

scrapyd=ScrapydAPI('http://47.97.169.234:6800')
print(scrapyd.list_projects()) #查询项目名
print(scrapyd.list_spiders('my17k')) #查询该项目名下的爬虫名

 

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!