Python POST Request Failing, [Errno 10054] An existing connection was forcibly closed by the remote host

匿名 (未验证) 提交于 2019-12-03 08:28:06

问题:

I'm using Beautiful Soup to try to scrape a web page. The code worked great but now it is not working. I think the problem is, the source site changed their login page. So I replaced the loginurl and it is apparently not able to connect to that url. I can connect to it directly. So can someone try to run this and tell me what I'm doing wrong?

import requests from bs4 import BeautifulSoup import re import pymysql import datetime  myurl = 'http://www.cbssports.com'  loginurl = 'https://auth.cbssports.com/login/index'  try:     response = requests.get(loginurl) except requests.exceptions.ConnectionError as e:     print "BAD DOMAIN"  payload = {      'dummy::login_form': 1,      'form::login_form': 'login_form',      'xurl': myurl,      'master_product': 150,      'vendor': 'cbssports',      'userid': 'myuserid',      'password': 'mypassword',     '_submit': 'Sign in' }  session = requests.session() p = session.post(loginurl, data=payload)  #(code to scrape the web page) 

I get the following error: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='auth.cbssports.com', port=443): Max retries exceeded with url: /login (Caused by : [Errno 10054] An existing connection was forcibly closed by the remote host)

Is the website actively blocking my automated login? Or do I have something wrong in the data payload?

Edit: Here's a simpler piece of code...

import requests  myurl = 'http://www.cbssports.com'  loginurl = 'https://auth.cbssports.com/login/index'  try:     response = requests.get(myurl) except requests.exceptions.ConnectionError as e:     print "My URL is BAD"  try:     response = requests.get(loginurl) except requests.exceptions.ConnectionError as e:     print "Login URL is BAD" 

Note that the login url is bad but the main one is not. I am able to access both urls manually in a browser. So why is the login page not accessible via Python?

回答1:

short answer: add a scheme (http://) to myurl (from www.cbssports.com to http://www.cbssports.com) before using it as the xurl post value.


longer answer: your session authentication and request code is fine. i believe the issue is that cbs's app is confused by your value for xurl, the parameter cbs reads to decide where to redirect a user after successful authentication). you're passing in a schemaless url, www.cbssports.com, which cbs is interpreting as a relative path - there is no http://cbssports.com/www.cbssports.com, so it (correctly, but confusingly) 404s. adding a scheme to make this an absolute url fixes this issue, giving you an authenticated session for all subsequent requests. huzzah!

however, i could not reproduce the connectionexception you experienced, which makes me wonder if that was network congestion rather than anti-scraping measures on cbs's side.

hope this is helpful.



回答2:

Ok I'm not sure why this worked, but I solved this by simply changing the https to http in the login address. And like magic, it worked. It appears that cbs has an unsecure version of the same page maybe (?).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!