Scraping in Python - Preventing IP ban

前端 未结 3 876
长情又很酷
长情又很酷 2020-12-22 22:18

I am using Python to scrape pages. Until now I didn\'t have any complicated issues.

The site that I\'m trying to scrape uses a lot of security checks an

相关标签:
3条回答
  • 2020-12-22 22:35

    If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

    • the built-in AutoThrottle extension:

    This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

    • rotating user agents with scrapy-fake-useragent middleware:

    Use a random User-Agent provided by fake-useragent every request

    • rotating IP addresses:

      • Setting Scrapy proxy middleware to rotate on each request
      • scrapy-proxies
    • you can also run it via local proxy & TOR:

      • Scrapy: Run Using TOR and Multiple Agents
    0 讨论(0)
  • 2020-12-22 22:51

    I had this problem too. I used urllib with tor in python3.

    1. download and install tor browser
    2. testing tor

    open terminal and type:

    curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>
    

    if you see result it's worked.

    1. Now we should test in python. Now run this code
    import socks
    import socket
    from urllib.request import Request, urlopen
    from bs4 import BeautifulSoup
    
    #set socks5 proxy to use tor
    
    socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
    socket.socket = socks.socksocket
    req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
    html = urlopen(req).read()
    soup = BeautifulSoup(html, 'html.parser')
    print(soup('title')[0].get_text())
    

    if you see

    Congratulations. This browser is configured to use Tor.

    it worked in python too and this means you are using tor for web scraping.

    0 讨论(0)
  • 2020-12-22 23:00

    You could use proxies.

    You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.

    You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.

    def load_proxy(PROXY_HOST,PROXY_PORT):
            fp = webdriver.FirefoxProfile()
            fp.set_preference("network.proxy.type", 1)
            fp.set_preference("network.proxy.http",PROXY_HOST)
            fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
            fp.set_preference("general.useragent.override","whater_useragent")
            fp.update_preferences()
            return webdriver.Firefox(firefox_profile=fp)
    
    0 讨论(0)
提交回复
热议问题