How to prevent getting blacklisted while scraping Amazon [closed]

问题

I try to scrape Amazon by Scrapy. but i have this error

DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> 
(failed 1 times): 503 Service Unavailable

I think that it's because = Amazon is very good at detecting bots. How can i prevent this?

i used time.sleep(6) before every request.

I don't want to use their API.

I tried I use tor and polipo

回答1:

You have to be very careful with Amazon and follow the Amazon Terms of Use and policies related to web-scraping.

Amazon is quite good at banning IPs of the bots. You would have to tweak the DOWNLOAD_DELAY and CONCURRENT_REQUESTS to hit the website less often and be a good web-scraping citizen. And, you would need to rotate IP addresses (you may look into, for instance, crawlera) and user agents.

来源：https://stackoverflow.com/questions/37077489/how-to-prevent-getting-blacklisted-while-scraping-amazon

标签

web-scraping

scrapy

web-crawler

amazon

scrapy-spider

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!