Getting “Crawled 324 pages (at 133 pages/min), scraped 304 items (at 130 items/min)” after paginating 18 pages while there are 42 pages to scrap?

女生的网名这么多〃 提交于 2020-04-30 10:43:52

问题


While I have written a script to scrape data from the site and it is working ideally but after scraping about 18 pages' data(as there are about 42 pages), the scrapy get stuck by giving a log info after and after.

I visited the similar questions answered on stackoverflow but in all of them the scripts were not working from the beginning while in my case the script scraped data from about 18 pages and then get stuck.

Here is the script

# -*- coding: utf-8 -*-
import scrapy
import logging

class KhaadiSpider(scrapy.Spider):
    name = 'khaadi'

    start_urls = ['https://www.khaadi.com/pk/woman.html/']

    def parse(self, response):
        urls= response.xpath('//ol/li/div/a/@href').extract()
        for url in urls:
            yield scrapy.Request(url, callback=self.product_page)
        next_page=response.xpath('//*[@class="action  next"]/@href').extract_first()
        while(next_page!=None):
            yield scrapy.Request(next_page)
        logging.info("Scraped all the pages Successfuly....")

    def product_page(self,response):
        image= response.xpath('//*[@class="MagicZoom"]/@href').extract_first()
        page_title= response.xpath('//*[@class="page-title"]/span/text()').extract_first()
        price=response.xpath('//*[@class="price"]/text()').extract_first()
        page_url=response.url

        yield {'Image':image,
               "Page Title":page_title,
               "Price":price,
               "Page Url":page_url

        }

This is the Logger info

2019-10-05 11:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.khaadi.com/pk/ksffs19301-blue.html> (referer: https://www.khaadi.com/pk/woman.html?p=18)
2019-10-05 11:22:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/b19428-pink-3pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/b/1/b19428b.jpg', 'Page Title': u'Shirt Shalwar Dupatta', 'Page Url': 'https://www.khaadi.com/pk/b19428-pink-3pc.html', 'Price': u'PKR2,170'}
2019-10-05 11:22:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/i19417-blue-2pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/i/1/i19417b.jpg', 'Page Title': u'Shirt Shalwar', 'Page Url': 'https://www.khaadi.com/pk/i19417-blue-2pc.html', 'Price': u'PKR1,680'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wshe19498-off-white.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/s/wshe19498_2_.jpg', 'Page Title': u'Embroidered Shalwar', 'Page Url': 'https://www.khaadi.com/pk/wshe19498-off-white.html', 'Price': u'PKR1,800'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wet19401-off-white.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/e/wet19401_offwhite__1_.jpg', 'Page Title': u'EMBELLISHED TIGHTS', 'Page Url': 'https://www.khaadi.com/pk/wet19401-off-white.html', 'Price': u'PKR1,000'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wet19402-black.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/e/wet19402_black__2_.jpg', 'Page Title': u'EMBELLISHED TIGHTS', 'Page Url': 'https://www.khaadi.com/pk/wet19402-black.html', 'Price': u'PKR1,000'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/k19407-yellow-3pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/k/1/k19407b.jpg', 'Page Title': u'Shirt Shalwar Dupatta', 'Page Url': 'https://www.khaadi.com/pk/k19407-yellow-3pc.html', 'Price': u'PKR2,940'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/k19408-blue-3pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/k/1/k19408a.jpg', 'Page Title': u'Shirt Shalwar Dupatta', 'Page Url': 'https://www.khaadi.com/pk/k19408-blue-3pc.html', 'Price': u'PKR2,940'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wet19408-pink.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/e/wet19408_pink__1_.jpg', 'Page Title': u'EMBELLISHED TIGHTS', 'Page Url': 'https://www.khaadi.com/pk/wet19408-pink.html', 'Price': u'PKR1,000'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wbme19474-off-white.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/b/wbme19474_offwhite__1_.jpg', 'Page Title': u'Embroidered Metallica Pants', 'Page Url': 'https://www.khaadi.com/pk/wbme19474-off-white.html', 'Price': u'PKR2,400'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/ksffs19301-blue.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/k/s/ksffs19301_blue__2_.jpg', 'Page Title': u'Semi Formal Full Suit', 'Page Url': 'https://www.khaadi.com/pk/ksffs19301-blue.html', 'Price': u'PKR18,000'}
2019-10-05 11:22:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 326 pages/min), scraped 307 items (at 307 items/min)
2019-10-05 11:23:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:24:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:25:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:26:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:27:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:28:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:29:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:30:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:31:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:32:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:33:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:34:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:35:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:36:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)

All the other files left default.


回答1:


Your page advance logic is correct, but it appears the server you're scraping may have some anti-scraping defense mechanisms in place.

When running your code as-is I got similar results, scraping basically stops after a while. I suspect the server detects it's being scraped and either slows down or completely stops responding to the scraping requests.

Just for test purposes I tweaked the code a bit to not hammer the server as bad, hoping to remain below the scraping detection radar:

    for url in urls:
        yield scrapy.Request(url, callback=self.product_page)
        break  # only scrape one product per page
    next_page = response.xpath('//*[@class="action  next"]/@href').extract_first()
    while (next_page != None):
        time.sleep(2)  # slow down scraping rate
        if next_page.endswith('p=2'):
            # jump to page 18, skipping what is known to work fine
            next_page = re.sub('p=2', 'p=18', next_page)
        yield scrapy.Request(next_page)

With these changes in place I can see the scraping (slowly) reaching past the page where it was stopping earlier and still going:

2019-10-05 12:57:28 [scrapy.extensions.logstats] INFO: Crawled 32 pages (at 0 pages/min), scraped 15 items (at 0 items/min)
2019-10-05 12:57:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.khaadi.com/pk/j19405-off-white-2pc.html> (referer: https://www.khaadi.com/pk/woman.html?p=33)
2019-10-05 12:58:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.khaadi.com/pk/woman.html?p=34> (referer: https://www.khaadi.com/pk/woman.html?p=33)
2019-10-05 12:58:28 [scrapy.extensions.logstats] INFO: Crawled 34 pages (at 2 pages/min), scraped 15 items (at 0 items/min)
2019-10-05 12:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/j19405-off-white-2pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/j/1/j19405a.jpg', 'Page Title': u'Shirt Shalwar', 'Page Url': 'https://www.khaadi.com/pk/j19405-off-white-2pc.html', 'Price': u'PKR1,190'}
2019-10-05 12:59:28 [scrapy.extensions.logstats] INFO: Crawled 34 pages (at 0 pages/min), scraped 16 items (at 1 items/min)

Even with these tweaks scraping was detected, so I further tweaked it to skip more pages, eventually getting to the last page and displaying the

2019-10-05 14:04:26 [root] INFO: Scraped all the pages Successfuly....

But scrapy doesn't shutdown, you'll need some more tweaking for that.



来源:https://stackoverflow.com/questions/58245976/getting-crawled-324-pages-at-133-pages-min-scraped-304-items-at-130-items-m

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!