How to prevent getting blacklisted while scraping Amazon [closed]

大城市里の小女人 提交于 2019-12-03 10:17:01

问题


I try to scrape Amazon by Scrapy. but i have this error

DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> 
(failed 1 times): 503 Service Unavailable

I think that it's because = Amazon is very good at detecting bots. How can i prevent this?

i used time.sleep(6) before every request.

I don't want to use their API.

I tried I use tor and polipo


回答1:


You have to be very careful with Amazon and follow the Amazon Terms of Use and policies related to web-scraping.

Amazon is quite good at banning IPs of the bots. You would have to tweak the DOWNLOAD_DELAY and CONCURRENT_REQUESTS to hit the website less often and be a good web-scraping citizen. And, you would need to rotate IP addresses (you may look into, for instance, crawlera) and user agents.



来源:https://stackoverflow.com/questions/37077489/how-to-prevent-getting-blacklisted-while-scraping-amazon

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!