Get proxy ip address scrapy using to crawl

问题

I use Tor to crawl web pages. I started tor and polipo service and added

class ProxyMiddleware(object):   # overwrite process request   def
  process_request(self, request, spider):
     # Set the location of the proxy
    request.meta['proxy'] = "127.0.0.1:8123"

Now, how can I make sure that scrapy uses different IP address for requests?

回答1:

You can yield the first request to check your public IP, and compare this to the IP you see when you go to http://checkip.dyndns.org/ without using Tor/VPN. If they are not the same, scrapy is using a different IP obviously.

def start_reqests():
    yield Request('http://checkip.dyndns.org/', callback=self.check_ip)
    # yield other requests from start_urls here if needed

def check_ip(self, response):
    pub_ip = response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]
    print "My public IP is: " + pub_ip

    # yield other requests here if needed

回答2:

The fastest option would be to use the scrapy shell and check for the meta to contain the proxy.

Start it from the project root:

$ scrapy shell http://google.com
>>> request.meta
{'handle_httpstatus_all': True, 'redirect_ttl': 20, 'download_timeout': 180, 'proxy': 'http://127.0.0.1:8123', 'download_latency': 0.4804518222808838, 'download_slot': 'google.com'}
>>> response.meta
{'download_timeout': 180, 'handle_httpstatus_all': True, 'redirect_ttl': 18, 'redirect_times': 2, 'redirect_urls': ['http://google.com', 'http://www.google.com/'], 'depth': 0, 'proxy': 'http://127.0.0.1:8123', 'download_latency': 1.5814828872680664, 'download_slot': 'google.com'}

This way you would check that middleware is configured correctly and the request is going through proxy.

来源：https://stackoverflow.com/questions/27364630/get-proxy-ip-address-scrapy-using-to-crawl

标签

python

proxy

web-scraping

scrapy

web-crawler