I'm using Beautiful Soup to try to scrape a web page. The code worked great but now it is not working. I think the problem is, the source site changed their login page. So I replaced the loginurl and it is apparently not able to connect to that url. I can connect to it directly. So can someone try to run this and tell me what I'm doing wrong?
import requests from bs4 import BeautifulSoup import re import pymysql import datetime myurl = 'http://www.cbssports.com' loginurl = 'https://auth.cbssports.com/login/index' try: response = requests.get(loginurl) except requests.exceptions.ConnectionError as e: print "BAD DOMAIN" payload = { 'dummy::login_form': 1, 'form::login_form': 'login_form', 'xurl': myurl, 'master_product': 150, 'vendor': 'cbssports', 'userid': 'myuserid', 'password': 'mypassword', '_submit': 'Sign in' } session = requests.session() p = session.post(loginurl, data=payload) #(code to scrape the web page)
I get the following error: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='auth.cbssports.com', port=443): Max retries exceeded with url: /login (Caused by : [Errno 10054] An existing connection was forcibly closed by the remote host)
Is the website actively blocking my automated login? Or do I have something wrong in the data payload?
Edit: Here's a simpler piece of code...
import requests myurl = 'http://www.cbssports.com' loginurl = 'https://auth.cbssports.com/login/index' try: response = requests.get(myurl) except requests.exceptions.ConnectionError as e: print "My URL is BAD" try: response = requests.get(loginurl) except requests.exceptions.ConnectionError as e: print "Login URL is BAD"
Note that the login url is bad but the main one is not. I am able to access both urls manually in a browser. So why is the login page not accessible via Python?