Scraping in Python - Preventing IP ban

前端 未结 3 877
长情又很酷
长情又很酷 2020-12-22 22:18

I am using Python to scrape pages. Until now I didn\'t have any complicated issues.

The site that I\'m trying to scrape uses a lot of security checks an

3条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-22 22:51

    I had this problem too. I used urllib with tor in python3.

    1. download and install tor browser
    2. testing tor

    open terminal and type:

    curl --socks5-hostname localhost:9050 
    

    if you see result it's worked.

    1. Now we should test in python. Now run this code
    import socks
    import socket
    from urllib.request import Request, urlopen
    from bs4 import BeautifulSoup
    
    #set socks5 proxy to use tor
    
    socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
    socket.socket = socks.socksocket
    req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
    html = urlopen(req).read()
    soup = BeautifulSoup(html, 'html.parser')
    print(soup('title')[0].get_text())
    

    if you see

    Congratulations. This browser is configured to use Tor.

    it worked in python too and this means you are using tor for web scraping.

提交回复
热议问题