I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the jo
I work also on web scrapping, using optimized C#, and it ends up CPU bound, so I am switching to C.
Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.
If you do the math, you are quickly compute bound but not memory bound.
Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache
Here's a collection of things to try:
CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)LOG_ENABLED = False (docs)yielding an item in a loop instead of collecting items into the items list and returning themScrapy on pypy, see Running Scrapy on PyPyHope that helps.
One workaround to speed up your scrapy is to config your start_urls appropriately.
For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:
start_urls = [
"http://apps.webofknowledge.com/doc=250",
"http://apps.webofknowledge.com/doc=750",
]
In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].