问题
My problematic is the following:
To win time, I would like to run several versions of one single spider. The process (parsing definitions) is the same, the items are the same, and the collection in database is the same. What is changing is the start_url variable. It looks like this:
"https://www.website.com/details/{0}-{1}-{2}/{3}/meeting".format(year,month,day,type_of_meeting)
Considering the date is the same, for instance 2018-10-24, I would like to launch two versions in the same time:
- version 1
type_of_meeting = pmu - version 2 with
type_of_meeting = pmh
This is the first part of my problematic. And here I wonder if I must create two different classes in one single spider, like class SpiderPmu(scrapy.Spider): and class SpiderPmh(scrapy.Spider): in spider.py. But if it is the best way you think I must do, I don't know how to implement it considering settings.py, pipelines.py. I already read about CrawlerProcess from scrapy.crawler module but I don't understand well how to implement it in my project. stack subject, scrapy doc. I am not sure the part
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() must be in the spider.py file. Above all, I am not sure it answers to my problematic.
The second part, is how to launch several versions considering different date intervals.
I already created some range of intervals in my spider class like:
year = range(2005,2019)month = range(1,13)day = range(1,32)
and put it in a loop. That works well.
But to win time, I would like to launch several spiders, with different intervals of years.
- 1st version with
year = range(2005,2007) - 2nd version with
year = range(2007,2009) - and so on, until
year = range(2017,2019)
Seven versions in the same time means seven times faster.
I could create 7 different projects for each range of years, but I think this is not the smartest way... and I am not sure if it will, or not, create a conflict to use the same collection database for 7 different projects running in the same time.
I expect to do something like opening 7 commands:
scrapy crawl spiderpmufor the versiontype_of_race = pmu"Enter a range of year":withraw_input = 2010, 2012==>range(2010,2012)- Spider is crawling
in parallel, if this is compulsory, to do:
scrapy crawl spiderpmhfor the versiontype_of_race = pmh"Enter a range of year":withraw_input = 2010, 2012==>range(2010,2012)- Spider is crawling
Possibly using one single spider, or one single project if needed.
How can I do?
PS: I already made arrangements with prolipo as proxy, Tor network to change IP, and USER_AGENT always changing. So, I avoid to be banned by crawling with multiple spiders in the same time. And my spider is "polite" with AUTOTHROTTLE_ENABLED = True. I want to keep it polite, but faster.
Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1
回答1:
Scrapy supports spider arguments. Weirdly enough there's no straightforward documentation, but I'll try to fill in:
When you run a crawl command you may provide -a NAME=VALUE arguments and these will be set as your spider class instance variables. For example:
class MySpider(Spider):
name = 'arg'
# we will set below when running the crawler
foo = None
bar = None
def start_requests(self):
url = f'http://example.com/{self.foo}/{self.bar}'
yield Request(url)
And if we run it:
scrapy crawl arg -a foo=1 -a bar=2
# will crawl example.com/1/2
回答2:
So, I find a solution inspired of the scrapy crawl -a variable=value
The spider concerned, in "spiders" folder was transformed:
class MySpider(scrapy.Spider):
name = "arg"
allowed_domains = ['www.website.com']
def __init__ (self, lo_lim=None, up_lim=None , type_of_race = None) : #lo_lim = 2017 , up_lim = 2019, type_of_race = pmu
year = range(int(lo_lim), int(up_lim)) # lower limit, upper limit, must be convert to integer type, instead this is string type
month = range(1,13) #12 months
day = range(1,32) #31 days
url = []
for y in year:
for m in month:
for d in day:
url.append("https://www.website.com/details/{}-{}-{}/{}/meeting".format(y,m,d,type_of_race))
self.start_urls = url #where url = ["https://www.website.com/details/2017-1-1/pmu/meeting",
#"https://www.website.com/details/2017-1-2/pmu/meeting",
#...
#"https://www.website.com/details/2017-12-31/pmu/meeting"
#"https://www.website.com/details/2018-1-1/pmu/meeting",
#"https://www.website.com/details/2018-1-2/pmu/meeting",
#...
#"https://www.website.com/details/2018-12-31/pmu/meeting"]
def parse(self, response):
...`
Then, it answers to my problematic: to keep one single spider, and to run several versions of it by serveral commands at one time without trouble.
Without a def __init__ it didn't work for me. I tried a lot of ways, that is this perfectible code that works for me.
Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1
来源:https://stackoverflow.com/questions/52977185/how-to-run-several-versions-of-one-single-spider-at-one-time-with-scrapy