Using Tor + Privoxy to scrape google shopping results: How to avoid block?

人走茶凉 提交于 2019-12-06 05:37:14

问题


I have installed Tor + Privoxy on my server and they're working fine! (Tested). But now when I try to use urllib2 (python) to scrape google shopping results, using proxy of course, I always get blocked by google (sometimes 503 error, sometimes 403 error). So anyone have any solutions can help me avoid that problem? It would be very appreciated!!

The source code that I am using:

 _HEADERS = {
      'User-Agent': 'Mozilla/5.0',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Encoding': 'deflate',
      'Connection': 'close',
      'DNT': '1'
  }

  request = urllib2.Request("https://www.google.com/#q=iphone+5&tbm=shop", headers=self._HEADERS)

  proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
  opener = urllib2.build_opener(proxy_support) 
  urllib2.install_opener(opener)

  try:
      response = urllib2.urlopen(request)
      html = response.read()
      print html

   except urllib2.HTTPError as e:
       print e.code
       print e.reason


Note that: When I don't use proxy, it can work fine!


回答1:


Have you installed stem, the controller library for Tor? In just a few lines of code you can request a new identity from Tor. See:

https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor

Simply use exceptions to catch your 403 and 503 errors and handle them by requesting a new identity, as shown in the link above. Best of luck.




回答2:


Google blocks many of exit Tor nodes because Google receive many requests from them. So this error is question of probability, change your exit Tor node until find one without be blocked by Google.

https://www.torproject.org/docs/faq.html.en#GoogleCAPTCHA



来源:https://stackoverflow.com/questions/19464427/using-tor-privoxy-to-scrape-google-shopping-results-how-to-avoid-block

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!