Speed up web scraper | 易学教程

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).

I have looked on the scrapy webpage and the mailing lists and stackoverflow, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!

I have listed my code below, if it's needed.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re

class Sale(Item):
    Adresse = Field()
    Pris = Field()
    Salgsdato = Field()
    SalgsType = Field()
    KvmPris = Field()
    Rum = Field()
    Postnummer = Field()
    Boligtype = Field()
    Kvm = Field()
    Bygget = Field()

class HouseSpider(BaseSpider):
    name = 'House'
    allowed_domains = ["http://boliga.dk/"]
    start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("id('searchresult')/tr")
        items = []      
        for site in sites:
            item = Sale()
            item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
            item['Pris'] = site.select("td[2]/text()").extract()
            item['Salgsdato'] = site.select("td[3]/text()").extract()
            Temp = site.select("td[4]/text()").extract()
            Temp = Temp[0]
            m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
            if m:
                found = m.group(1)
                item['SalgsType'] = found
            else:
                item['SalgsType'] = Temp
            item['KvmPris'] = site.select("td[5]/text()").extract()
            item['Rum'] = site.select("td[6]/text()").extract()
            item['Postnummer'] = site.select("td[7]/text()").extract()
            item['Boligtype'] = site.select("td[8]/text()").extract()
            item['Kvm'] = site.select("td[9]/text()").extract()
            item['Bygget'] = site.select("td[10]/text()").extract()
            items.append(item)
        return items

Thanks!

alecxe

Here's a collection of things to try:

use latest scrapy version (if not using already)
check if non-standard middlewares are used
try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
turn off logging LOG_ENABLED = False (docs)
try yielding an item in a loop instead of collecting items into the items list and returning them
use local cache DNS (see this thread)
check if this site is using download threshold and limits your download speed (see this thread)
log cpu and memory usage during the spider run - see if there are any problems there
try run the same spider under scrapyd service
see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
try running Scrapy on pypy, see Running Scrapy on PyPy

Hope that helps.

Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

I work also on web scrapping, using optimized C#, and it ends up CPU bound, so I am switching to C.

Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.

If you do the math, you are quickly compute bound but not memory bound.

One workaround to speed up your scrapy is to config your start_urls appropriately.

For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:

 start_urls = [
    "http://apps.webofknowledge.com/doc=250",
    "http://apps.webofknowledge.com/doc=750",
]

In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].

来源：https://stackoverflow.com/questions/17029752/speed-up-web-scraper

标签

python

performance

web-scraping

scrapy

scrapy-spider