I am using Python
to scrape pages. Until now I didn\'t have any complicated issues.
The site that I\'m trying to scrape uses a lot of security checks an
If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
Use a random User-Agent provided by fake-useragent every request
rotating IP addresses:
you can also run it via local proxy & TOR:
I had this problem too. I used urllib
with tor
in python3
.
open terminal and type:
curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>
if you see result it's worked.
import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#set socks5 proxy to use tor
socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())
if you see
Congratulations. This browser is configured to use Tor.
it worked in python too and this means you are using tor for web scraping.
You could use proxies.
You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.
You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.
def load_proxy(PROXY_HOST,PROXY_PORT):
fp = webdriver.FirefoxProfile()
fp.set_preference("network.proxy.type", 1)
fp.set_preference("network.proxy.http",PROXY_HOST)
fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
fp.set_preference("general.useragent.override","whater_useragent")
fp.update_preferences()
return webdriver.Firefox(firefox_profile=fp)