I am scraping 23770 webpages with a pretty simple web scraper using scrapy
. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).
I have looked on the scrapy
webpage and the mailing lists and stackoverflow
, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!
I have listed my code below, if it's needed.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re
class Sale(Item):
Adresse = Field()
Pris = Field()
Salgsdato = Field()
SalgsType = Field()
KvmPris = Field()
Rum = Field()
Postnummer = Field()
Boligtype = Field()
Kvm = Field()
Bygget = Field()
class HouseSpider(BaseSpider):
name = 'House'
allowed_domains = ["http://boliga.dk/"]
start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("id('searchresult')/tr")
items = []
for site in sites:
item = Sale()
item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
item['Pris'] = site.select("td[2]/text()").extract()
item['Salgsdato'] = site.select("td[3]/text()").extract()
Temp = site.select("td[4]/text()").extract()
Temp = Temp[0]
m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
if m:
found = m.group(1)
item['SalgsType'] = found
else:
item['SalgsType'] = Temp
item['KvmPris'] = site.select("td[5]/text()").extract()
item['Rum'] = site.select("td[6]/text()").extract()
item['Postnummer'] = site.select("td[7]/text()").extract()
item['Boligtype'] = site.select("td[8]/text()").extract()
item['Kvm'] = site.select("td[9]/text()").extract()
item['Bygget'] = site.select("td[10]/text()").extract()
items.append(item)
return items
Thanks!
Here's a collection of things to try:
- use latest scrapy version (if not using already)
- check if non-standard middlewares are used
- try to increase
CONCURRENT_REQUESTS_PER_DOMAIN
,CONCURRENT_REQUESTS
settings (docs) - turn off logging
LOG_ENABLED = False
(docs) - try
yield
ing an item in a loop instead of collecting items into theitems
list and returning them - use local cache DNS (see this thread)
- check if this site is using download threshold and limits your download speed (see this thread)
- log cpu and memory usage during the spider run - see if there are any problems there
- try run the same spider under scrapyd service
- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
- try running
Scrapy
onpypy
, see Running Scrapy on PyPy
Hope that helps.
Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED
setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache
I work also on web scrapping, using optimized C#, and it ends up CPU bound, so I am switching to C.
Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.
If you do the math, you are quickly compute bound but not memory bound.
One workaround to speed up your scrapy is to config your start_urls
appropriately.
For example, If our target data is in http://apps.webofknowledge.com/doc=1
where the doc number range from 1
to 1000
, you can config your start_urls
in followings:
start_urls = [
"http://apps.webofknowledge.com/doc=250",
"http://apps.webofknowledge.com/doc=750",
]
In this way, requests will start from 250
to 251,249
and from 750
to 751,749
simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"]
.
来源:https://stackoverflow.com/questions/17029752/speed-up-web-scraper