How to get a single item across many sites in scrapy?

不羁的心 提交于 2020-01-03 05:26:13

问题


I have this situation:

I want to crawl products details from a specific product detail page which describes the product (Page A), this page contains a link to a page that list sellers of this product (Page B), in each seller is a link to another page (Page C) which contains seller details, here is an example schema:

Page A:

  • product_name
  • link to sellers of this product (Page B)

Page B:

  • list of sellers, each one containing:
    • seller_name
    • selling_price
    • link to the seller details page (Page C)

Page C:

  • seller_address

This is the json I want to obtain after crawling:

{
  "product_name": "product1",
  "sellers": [
    {
      "seller_name": "seller1",
      "seller_price": 100,
      "seller_address": "address1",
    },
    (...)
  ]
}

What I have tried: passing the product information from in parse method to second parse method in meta object, this works fine on 2 levels, but I have 3, and I want a single item.

Is this possible in scrapy?

EDIT:

as requested here is a minified example of what I am trying to do, I know it wont work as expected, but I can not figure out how to make it return only 1 composed object:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'examplespider'
    allowed_domains = ["example.com"]

    start_urls = [
        'http://example.com/products/product1'
    ]

    def parse(self, response):

        # assume this object was obtained after
        # some xpath processing
        product_name = 'product1'
        link_to_sellers = 'http://example.com/products/product1/sellers'

        yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
            'product': {
                'product_name': product_name,
                'sellers': []
            }
        })

    def parse_sellers(self, response):
        product = response.meta['product']

        # assume this object was obtained after
        # some xpath processing
        sellers = [
            {
                seller_name = 'seller1',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller1',
            },
            {
                seller_name = 'seller2',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller2',
            },
            {
                seller_name = 'seller3',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller3',
            }
        ]

        for seller in sellers:
            product['sellers'].append(seller)
            yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

    def parse_seller(self, response):
        seller = response.meta['seller']

        # assume this object was obtained after
        # some xpath processing
        seller_address = 'seller_address1'

        seller['seller_address'] = seller_address

        yield seller

回答1:


You need to change your logic a bit, so as it to query one seller address at a time only and once that completes you query other sellers.

def parse_sellers(self, response):
    meta = response.meta

    # assume this object was obtained after
    # some xpath processing
    sellers = [
        {
            seller_name = 'seller1',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller1',
        },
        {
            seller_name = 'seller2',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller2',
        },
        {
            seller_name = 'seller3',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller3',
        }
    ]

    current_seller = sellers.pop()
    if current_seller:
       meta['pending_sellers'] = sellers
       meta['current_seller'] = current_seller
       yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
       yield product


    # for seller in sellers:
    #     product['sellers'].append(seller)
    #     yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

def parse_seller(self, response):
    meta = response.meta
    current_seller = meta['current_seller']
    sellers = meta['pending_sellers']
    # assume this object was obtained after
    # some xpath processing
    seller_address = 'seller_address1'

    current_seller['seller_address'] = seller_address

    meta['product']['sellers'].append(current_seller)
    if sellers:
        current_seller = sellers.pop()
        meta['pending_sellers'] = sellers
        meta['current_seller'] = current_seller

        yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
        yield meta['product']

But this is till not a great approach, reason being a seller may be selling multiple items. So when you reach the a item by same seller again then your request for seller address would get rejected by dupe filter. You can fix that by adding dont_filter=True to the request but that would mean too many unnecessary hits to the website

So you need to add DB handling directly in code to check if you already have a sellers details, if yes then use them, if not then you need fetch the details.




回答2:


I think a pipeline could help.

Assuming yielded seller is in the following format (which can be done by some trivial modification of the code):

seller = {
    'product_name': 'product1',
    'seller': {
        'seller_name': 'seller1',
        'seller_price': 100,
        'seller_address': 'address1',
    }
}

A pipeline like the following will collect sellers by their product_name and export to a file named 'items.jl' after crawling (Note this is just a sketch of the idea so it is not guaranteed to work):

class CollectorPipeline(object):

    def __init__(self):
        self.collection = {}

    def open_spider(self, spider):
        self.collection = {}

    def close_spider(self, spider):
        with open("items.jl", "w") as fp:
            for _, product in self.collection.items():
                fp.write(json.dumps(product))
                fp.write("\n")

    def process_item(self, item, spider):
        product = self.collection.get(item["product_name"], dict())
        product["product_name"] = item["product_name"]
        sellers = product.get("sellers", list())
        sellers.append(item["seller"])

        return item

BTW your need to modify your settings.py to make the pipeline effective, as described in scrapy document.



来源:https://stackoverflow.com/questions/46413023/how-to-get-a-single-item-across-many-sites-in-scrapy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!