可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm using Beautiful Soup to try to scrape a web page. The code worked great but now it is not working. I think the problem is, the source site changed their login page. So I replaced the loginurl and it is apparently not able to connect to that url. I can connect to it directly. So can someone try to run this and tell me what I'm doing wrong?

import requests from bs4 import BeautifulSoup import re import pymysql import datetime  myurl = 'http://www.cbssports.com'  loginurl = 'https://auth.cbssports.com/login/index'  try:     response = requests.get(loginurl) except requests.exceptions.ConnectionError as e:     print "BAD DOMAIN"  payload = {      'dummy::login_form': 1,      'form::login_form': 'login_form',      'xurl': myurl,      'master_product': 150,      'vendor': 'cbssports',      'userid': 'myuserid',      'password': 'mypassword',     '_submit': 'Sign in' }  session = requests.session() p = session.post(loginurl, data=payload)  #(code to scrape the web page)

I get the following error: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='auth.cbssports.com', port=443): Max retries exceeded with url: /login (Caused by : [Errno 10054] An existing connection was forcibly closed by the remote host)

Is the website actively blocking my automated login? Or do I have something wrong in the data payload?

Edit: Here's a simpler piece of code...

import requests  myurl = 'http://www.cbssports.com'  loginurl = 'https://auth.cbssports.com/login/index'  try:     response = requests.get(myurl) except requests.exceptions.ConnectionError as e:     print "My URL is BAD"  try:     response = requests.get(loginurl) except requests.exceptions.ConnectionError as e:     print "Login URL is BAD"

Note that the login url is bad but the main one is not. I am able to access both urls manually in a browser. So why is the login page not accessible via Python?

回答1:

short answer: add a scheme (http://) to myurl (from www.cbssports.com to http://www.cbssports.com) before using it as the xurl post value.

longer answer: your session authentication and request code is fine. i believe the issue is that cbs's app is confused by your value for xurl, the parameter cbs reads to decide where to redirect a user after successful authentication). you're passing in a schemaless url, www.cbssports.com, which cbs is interpreting as a relative path - there is no http://cbssports.com/www.cbssports.com, so it (correctly, but confusingly) 404s. adding a scheme to make this an absolute url fixes this issue, giving you an authenticated session for all subsequent requests. huzzah!

however, i could not reproduce the connectionexception you experienced, which makes me wonder if that was network congestion rather than anti-scraping measures on cbs's side.

hope this is helpful.

回答2:

Ok I'm not sure why this worked, but I solved this by simply changing the https to http in the login address. And like magic, it worked. It appears that cbs has an unsecure version of the same page maybe (?).

文章来源: Python POST Request Failing, [Errno 10054] An existing connection was forcibly closed by the remote host

标签

errno

post

python