How can I make Selenium run in parallel with Scrapy?

安稳与你 提交于 2021-02-19 06:05:04

问题


I'm trying to scrape some urls with Scrapy and Selenium. Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.

The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.

I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable. I also tried to spawn another process in parse method. But seems that I don't have enought experience to make it right.

In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?

import time

import scrapy
from selenium.webdriver import Firefox


def load_with_selenium(url):
    with Firefox() as driver:
        driver.get(url)
        time.sleep(10)  # Do something
        page = driver.page_source
    return page


class TestSpider(scrapy.Spider):
    name = 'test_spider'

    tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
             {'start_url': 'https://www.nytimes.com/', 'selenium': True}]

    def start_requests(self):
        for task in self.tasks:
            yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)

    def parse(self, response):
        if response.meta['selenium']:
            response = response.replace(body=load_with_selenium(response.meta['start_url']))

        for url in response.xpath('//a/@href').getall():
            print(url)

回答1:


It seems that I've found a solution.

I decided to use multiprocessing, running one spider in each process and passing a task as its init parameter. In some cases this approach may be inappropriate, but it works for me.

I tried this way before but I was getting the twisted.internet.error.ReactorNotRestartable exception. It was caused by calling the start() method of the CrawlerProcess in each process multiple times, which is incorrect. Here I found a simple and clear example of running a spider in a loop using callbacks.

So I split my tasks list between the processes. Then inside the crawl(tasks) method I make a chain of callbacks to run my spider multiple times passing a different task as its init parameter every time.

import multiprocessing

import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
         {'start_url': 'https://www.nytimes.com/', 'selenium': True}]


def crawl(tasks):
    process = CrawlerProcess(get_project_settings())

    def run_spider(_, index=0):
        if index < len(tasks):
            deferred = process.crawl('test_spider', task=tasks[index])
            deferred.addCallback(run_spider, index + 1)
            return deferred

    run_spider(None)
    process.start()


def main():
    processes = 2
    with multiprocessing.Pool(processes) as pool:
        pool.map(crawl, np.array_split(tasks, processes))


if __name__ == '__main__':
    main()

The code of TestSpider in my question post must be modified accordingly to accept a task as an init parameter.

def __init__(self, task):
    scrapy.Spider.__init__(self)
    self.task = task

def start_requests(self):
    yield scrapy.Request(url=self.task['start_url'], callback=self.parse, meta=self.task)


来源:https://stackoverflow.com/questions/61194207/how-can-i-make-selenium-run-in-parallel-with-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!