How to run several versions of one single spider at one time with Scrapy?

烂漫一生 提交于 2019-12-18 09:22:56

问题


My problematic is the following:

To win time, I would like to run several versions of one single spider. The process (parsing definitions) is the same, the items are the same, and the collection in database is the same. What is changing is the start_url variable. It looks like this:

"https://www.website.com/details/{0}-{1}-{2}/{3}/meeting".format(year,month,day,type_of_meeting)

Considering the date is the same, for instance 2018-10-24, I would like to launch two versions in the same time:

  • version 1 type_of_meeting = pmu
  • version 2 with type_of_meeting = pmh

This is the first part of my problematic. And here I wonder if I must create two different classes in one single spider, like class SpiderPmu(scrapy.Spider): and class SpiderPmh(scrapy.Spider): in spider.py. But if it is the best way you think I must do, I don't know how to implement it considering settings.py, pipelines.py. I already read about CrawlerProcess from scrapy.crawler module but I don't understand well how to implement it in my project. stack subject, scrapy doc. I am not sure the part process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() must be in the spider.py file. Above all, I am not sure it answers to my problematic.

The second part, is how to launch several versions considering different date intervals.

I already created some range of intervals in my spider class like:

  • year = range(2005,2019)
  • month = range(1,13)
  • day = range(1,32)

and put it in a loop. That works well.

But to win time, I would like to launch several spiders, with different intervals of years.

  • 1st version with year = range(2005,2007)
  • 2nd version with year = range(2007,2009)
  • and so on, until year = range(2017,2019)

Seven versions in the same time means seven times faster.

I could create 7 different projects for each range of years, but I think this is not the smartest way... and I am not sure if it will, or not, create a conflict to use the same collection database for 7 different projects running in the same time.

I expect to do something like opening 7 commands:

  1. scrapy crawl spiderpmu for the version type_of_race = pmu
  2. "Enter a range of year": with raw_input = 2010, 2012 ==> range(2010,2012)
  3. Spider is crawling

in parallel, if this is compulsory, to do:

  1. scrapy crawl spiderpmh for the version type_of_race = pmh
  2. "Enter a range of year": with raw_input = 2010, 2012 ==> range(2010,2012)
  3. Spider is crawling

Possibly using one single spider, or one single project if needed.

How can I do?

PS: I already made arrangements with prolipo as proxy, Tor network to change IP, and USER_AGENT always changing. So, I avoid to be banned by crawling with multiple spiders in the same time. And my spider is "polite" with AUTOTHROTTLE_ENABLED = True. I want to keep it polite, but faster.

Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1


回答1:


Scrapy supports spider arguments. Weirdly enough there's no straightforward documentation, but I'll try to fill in:

When you run a crawl command you may provide -a NAME=VALUE arguments and these will be set as your spider class instance variables. For example:

class MySpider(Spider):
    name = 'arg'
    # we will set below when running the crawler
    foo = None 
    bar = None

    def start_requests(self):
        url = f'http://example.com/{self.foo}/{self.bar}'
        yield Request(url)

And if we run it:

scrapy crawl arg -a foo=1 -a bar=2
# will crawl example.com/1/2



回答2:


So, I find a solution inspired of the scrapy crawl -a variable=value

The spider concerned, in "spiders" folder was transformed:

class MySpider(scrapy.Spider):
name = "arg"
allowed_domains = ['www.website.com']

    def __init__ (self, lo_lim=None, up_lim=None , type_of_race = None) : #lo_lim = 2017 , up_lim = 2019, type_of_race = pmu
        year  = range(int(lo_lim), int(up_lim)) # lower limit, upper limit, must be convert to integer type, instead this is string type
        month = range(1,13) #12 months
        day   = range(1,32) #31 days
        url   = []
        for y in year:
            for m in month:
                for d in day:
                    url.append("https://www.website.com/details/{}-{}-{}/{}/meeting".format(y,m,d,type_of_race))

        self.start_urls = url #where url = ["https://www.website.com/details/2017-1-1/pmu/meeting",
                                        #"https://www.website.com/details/2017-1-2/pmu/meeting",
                                        #...
                                        #"https://www.website.com/details/2017-12-31/pmu/meeting"
                                        #"https://www.website.com/details/2018-1-1/pmu/meeting",
                                        #"https://www.website.com/details/2018-1-2/pmu/meeting",
                                        #...
                                        #"https://www.website.com/details/2018-12-31/pmu/meeting"]

    def parse(self, response):
        ...`

Then, it answers to my problematic: to keep one single spider, and to run several versions of it by serveral commands at one time without trouble.

Without a def __init__ it didn't work for me. I tried a lot of ways, that is this perfectible code that works for me.

Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1



来源:https://stackoverflow.com/questions/52977185/how-to-run-several-versions-of-one-single-spider-at-one-time-with-scrapy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!