Scrapy: how to use items in spider and how to send items to pipelines?

大憨熊 提交于 2019-12-02 17:41:43
Adrien Blanquer
  • How to use items in my spider?

Well, the main purpose of items is to store the data you crawled. scrapy.Items are basically dictionaries. To declare your items, you will have to create a class and add scrapy.Field in it:

import scrapy

class Product(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()

You can now use it in your spider by importing your Product.

For advanced information, I let you check the doc here

  • How to send items to the pipeline ?

First, you need to tell to your spider to use your custom pipeline.

In the settings.py file:

ITEM_PIPELINES = {
    'myproject.pipelines.CustomPipeline': 300,
}

You can now write your pipeline and play with your item.

In the pipeline.py file:

from scrapy.exceptions import DropItem

class CustomPipeline(object):
   def __init__(self):
        # Create your database connection

    def process_item(self, item, spider):
        # Here you can index your item
        return item

Finally, in your spider, you need to yield your item once it is filled.

spider.py example:

import scrapy
from myspider.items import Product

class MySpider(scrapy.Spider):
    name = "test"
    start_urls = [
        'http://www.exemple.com',
    ]
def parse(self, response):
    doc = Product()
    doc['url'] = response.url
    doc['title'] = response.xpath('//div/p/text()')
    yield doc # Will go to your pipeline

Hope this helps, here is the doc for pipelines: Item Pipeline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!