问题
I try to scrape Amazon by Scrapy. but i have this error
DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031>
(failed 1 times): 503 Service Unavailable
I think that it's because = Amazon is very good at detecting bots. How can i prevent this?
i used time.sleep(6)
before every request.
I don't want to use their API.
I tried I use tor and polipo
回答1:
You have to be very careful with Amazon and follow the Amazon Terms of Use and policies related to web-scraping.
Amazon is quite good at banning IPs of the bots. You would have to tweak the DOWNLOAD_DELAY and CONCURRENT_REQUESTS to hit the website less often and be a good web-scraping citizen. And, you would need to rotate IP addresses (you may look into, for instance, crawlera) and user agents.
来源:https://stackoverflow.com/questions/37077489/how-to-prevent-getting-blacklisted-while-scraping-amazon