How to get scraped items from main script using scrapy?

陌路散爱 提交于 2020-01-02 02:51:30

问题


I hope to get a list of scraped items in main script instead of using scrapy shell.

I know there is a method parse in class FooSpider I define, and this method return a list of Item. Scrapy framework calls this method. But, how can I get this returned list by myself?

I found so many posts about that, but I don't understand what they were saying.

As a context, I put official example code here

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        result = []
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            result.append(item)

        return result

How could I get returned result from a main python script like main.py or run.py?

if __name__ == "__main__":
    ...
    result = xxxx()
    for item in result:
        print item

Could anyone provide a code snippet in which I get this returned list from somewhere?

Thank you very much!


回答1:


This is an example how you can collect all items in a list with a Pipeline:

#!/usr/bin/python3

# Scrapy API imports
import scrapy
from scrapy.crawler import CrawlerProcess

# your spider
from FollowAllSpider import FollowAllSpider

# list to collect all items
items = []

# pipeline to fill the items list
class ItemCollectorPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        items.append(item)

# create a crawler process with the specified settings
process = CrawlerProcess({
    'USER_AGENT': 'scrapy',
    'LOG_LEVEL': 'INFO',
    'ITEM_PIPELINES': { '__main__.ItemCollectorPipeline': 100 }
})

# start the spider
process.crawl(FollowAllSpider)
process.start()

# print the items
for item in items:
    print("url: " + item['url'])

You can get FollowAllSpider from here, or use your own spider. Example output when using it with my webpage:

$ ./crawler.py 
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Versions: lxml 3.7.1.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.5.3 (default, Jan 19 2017, 14:11:04) - [GCC 6.3.0 20170118], pyOpenSSL 16.2.0 (OpenSSL 1.1.0f  25 May 2017), cryptography 1.7.1, Platform Linux-4.9.0-6-amd64-x86_64-with-debian-9.5
2018-09-16 22:28:09 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'scrapy', 'LOG_LEVEL': 'INFO'}
[...]
2018-09-16 22:28:15 [scrapy.core.engine] INFO: Spider closed (finished)
url: http://www.frank-buss.de/
url: http://www.frank-buss.de/impressum.html
url: http://www.frank-buss.de/spline.html
url: http://www.frank-buss.de/schnecke/index.html
url: http://www.frank-buss.de/solitaire/index.html
url: http://www.frank-buss.de/forth/index.html
url: http://www.frank-buss.de/pi.tex
[...]



回答2:


If what you want is to work with/process/transform or store the items you should look into the Item Pipeline and the usual scrapy crawl would do the trick.



来源:https://stackoverflow.com/questions/38187891/how-to-get-scraped-items-from-main-script-using-scrapy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!