One spider with 2 different URL and 2 parse using Scrapy

不羁岁月 提交于 2019-12-25 00:14:36

问题


Hi I have 2 different domain with 2 different approach running in one spider I have tried this code but nothing works any idea please?

class SalesitemSpiderSpider(scrapy.Spider):
    name = 'salesitem_spider'
    allowed_domains = ['www2.hm.com','www.forever21.com']
    url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20']

   #Json Payload code here

    def start_requests(self):
       for i in self.url:
        if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
            print("sample: " + i)
            payload = self.payload.copy()
            payload['page']['pageNo'] = 1
            yield scrapy.Request(
            i, method='POST', body=json.dumps(payload),
            headers={'X-Requested-With': 'XMLHttpRequest',
                 'Content-Type': 'application/json; charset=UTF-8'},
            callback=self.parse_2, meta={'pageNo': 1})

        if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
            yield scrapy.Request(i, callback=self.parse_1)

    def parse_1(self, response):
     #Some code of getting item 

    def parse_2(self, response):
     data = json.loads(response.text)
        for product in data['CatalogProducts']:
            item = GpdealsSpiderItem_f21()
         #item yield

        yield item

        # simulate pagination if we are not at the end
        if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
            payload = self.payload.copy()
            payload['page']['pageNo'] = response.meta['pageNo'] + 1
            yield scrapy.Request(
              self.url, method='POST', body=json.dumps(payload),
             headers={'X-Requested-With': 'XMLHttpRequest',
                        'Content-Type': 'application/json; charset=UTF-8'},
               callback=self.parse_2, meta={'pageNo': payload['page']['pageNo']}
           )

I just always have this problem

NameError: name 'url' is not defined


回答1:


You have two different spiders in the same class. For the sake of maintainability, I recommend you to keep them in different files.

If you really want to keep them together, it would be easier split the urls into two lists:

type1_urls = ['https://www.forever21.com/eu/shop/Catalog/GetProducts', ]
type2_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20', ]

def start_requests(self):
    for url in self.type1_urls:
        payload = self.payload.copy()
        yield Request(
            # ...
            callback=self.parse_1
       )

    for url in self.type2_urls:
        yield scrapy.Request(url, callback=self.parse_2)



回答2:


You should use self.url in for cycle and then work with i variable inside your loop for comparison, request yielding, etc.:

def start_requests(self):
    for i in self.url:
        if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
            payload = self.payload.copy()
            payload['page']['pageNo'] = 1
            yield scrapy.Request(
                i, method='POST', body=json.dumps(payload),
                headers={'X-Requested-With': 'XMLHttpRequest',
                     'Content-Type': 'application/json; charset=UTF-8'},
                callback=self.parse_2, meta={'pageNo': 1})

        if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
            yield scrapy.Request(i, callback=self.parse_1)


来源:https://stackoverflow.com/questions/55761521/one-spider-with-2-different-url-and-2-parse-using-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!