Scrapy crawlers not running simultaneously from Python script

为君一笑 提交于 2020-01-06 20:14:47

问题


I was just wondering why this might be occurring. Here is my Python script to run all:

from scrapy import cmdline

file = open('cityNames.txt', 'r')
cityNames = file.read().splitlines()

for city in cityNames:
    url = "http://" + city + ".website.com"
    output = city + ".json"

    cmdline.execute(['scrapy', 'crawl', 'backpage_tester', '-a', "start_url="+url, '-o', ""+output])

cityNames.txt:

chicago
sanfran
boston

It runs the through the first city fine, but then stops after that. It doesn't run sanfran or boston - only chicago. Any thoughts? Thank you!


回答1:


Your method is using synchronous calls. You should use asynchronous calls in Python (asyncio?) or use a bash script that iterates over a text file of your urls:

cat urls.txt | xargs -0 -I{} scrapy crawl spider_name -a start_url={}

this should issue one scrapy process per url. However, be warned-- this could easily overload your system if those crawls are extensive and deep on each site, and your spiders are not properly configured.



来源:https://stackoverflow.com/questions/33663877/scrapy-crawlers-not-running-simultaneously-from-python-script

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!