Get scrapy result inside a Django view

时光总嘲笑我的痴心妄想 提交于 2019-12-23 02:49:17

问题


I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view.

My view is as follows:

def start_crawl(process_number, court):
    """
    Starts the crawler.

        Args:
            process_number (str): Process number to be found.
            court (str): Court of the process.
    """
    runner = CrawlerRunner(get_project_settings())
    results = list()

    def crawler_results(sender, parse_result, **kwargs):
        results.append(parse_result)

    dispatcher.connect(crawler_results, signal=signals.item_passed)
    process_info = runner.crawl(MySpider, process_number=process_number, court=court)

    return results

I followed this solution but results list is always empty.

I read something as creating a custom middleware and getting the results at the process_spider_output method.

How can I get the desired result?

Thanks!


回答1:


I managed to implement something like that in one of my projects. It is a mini-project and I was looking for a quick solution. You'll might need modify it or support multi-threading etc in case you put it in production environment.

Overview

I created an ItemPipeline that just add the items into a InMemoryItemStore helper. Then, in my __main__ code I wait for the crawler to finish, and pop all the items out of the InMemoryItemStore. Then I can manipulate the items as I wish.

Code

items_store.py

Hacky in-memory store. It is not very elegant but it got the job done for me. Modify and improve if you wish. I've implemented that as a simple class object so I can simply import it anywhere in the project and use it without passing its instance around.

class InMemoryItemStore(object):
    __ITEM_STORE = None

    @classmethod
    def pop_items(cls):
        items = cls.__ITEM_STORE or []
        cls.__ITEM_STORE = None
        return items

    @classmethod
    def add_item(cls, item):
        if not cls.__ITEM_STORE:
            cls.__ITEM_STORE = []
        cls.__ITEM_STORE.append(item)

pipelines.py

This pipleline will store the objects in the in-memory store from the snippet above. All items are simply returned to keep the regular pipeline flow intact. If you don't want to pass some items down the to the other pipelines simply change process_item to not return all items.

from <your-project>.items_store import InMemoryItemStore


class StoreInMemoryPipeline(object):
    """Add items to the in-memory item store."""
    def process_item(self, item, spider):
        InMemoryItemStore.add_item(item)
        return item

settings.py

Now add the StoreInMemoryPipeline in the scraper settings. If you change the process_item method above, make sure you set the proper priority here (changing the 100 down here).

ITEM_PIPELINES = {
   ...
   '<your-project-name>.pipelines.StoreInMemoryPipeline': 100,
   ...
}

main.py

This is where I tie all these things together. I clean the in-memory store, run the crawler, and fetch all the items.

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from <your-project>.items_store import InMemoryItemStore
from <your-project>.spiders.your_spider import YourSpider

def get_crawler_items(**kwargs):
    InMemoryItemStore.pop_items()

    process = CrawlerProcess(get_project_settings())
    process.crawl(YourSpider, **kwargs)
    process.start()  # the script will block here until the crawling is finished
    process.stop()
    return InMemoryItemStore.pop_items()

if __name__ == "__main__":
    items = get_crawler_items()



回答2:


If you really want to collect all data in a "special" object. Store the data in a separate pipeline like https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter and in close_spider (https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=close_spider#close_spider) you open your django object.



来源:https://stackoverflow.com/questions/52141729/get-scrapy-result-inside-a-django-view

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!