Speed up web scraper

前端未结

关注

 4  1764

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the jo

相关标签:

4条回答

长情又很酷

2021-01-30 03:58

I work also on web scrapping, using optimized C#, and it ends up CPU bound, so I am switching to C.

Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.

If you do the math, you are quickly compute bound but not memory bound.

0 讨论(0)
发布评论:

提交评论
- 加载中...
迷失自我

2021-01-30 04:00

Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2021-01-30 04:08
Here's a collection of things to try:
- use latest scrapy version (if not using already)
- check if non-standard middlewares are used
- try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
- turn off logging LOG_ENABLED = False (docs)
- try yielding an item in a loop instead of collecting items into the items list and returning them
- use local cache DNS (see this thread)
- check if this site is using download threshold and limits your download speed (see this thread)
- log cpu and memory usage during the spider run - see if there are any problems there
- try run the same spider under scrapyd service
- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
- try running Scrapy on pypy, see Running Scrapy on PyPy
Hope that helps.
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2021-01-30 04:18
One workaround to speed up your scrapy is to config your start_urls appropriately.

For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:
```
 start_urls = [
    "http://apps.webofknowledge.com/doc=250",
    "http://apps.webofknowledge.com/doc=750",
]
```
In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].
0 讨论(0)
发布评论:

提交评论
- 加载中...