问题
I want to scrape website that requires login with Python and BeautifulSoup and requests libs. (no selenium) This is my code:
import requests
from bs4 import BeautifulSoup
auth = (username, password)
headers = {
'authority': 'signon.springer.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'origin': 'https://signon.springer.com',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://signon.springer.com/login?service=https%3A%2F%2Fpress.nature.com%2Fcallback%3Fclient_name%3DCasClienthttps%3A%2F%2Fpress.nature.com&locale=en>m=GTM-WDRMH37&message=This+page+is+only+accessible+for+approved+journalists.+Please+log+into+your+press+site+account.+For+more+information%3A+https%3A%2F%2Fpress.nature.com%2Fapprove-as-a-journalist&_ga=2.25951165.1431685211.1610963078-2026442578.1607341887',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': 'SESSION=40d2be77-b3df-4eb6-9f3b-dac31ab66ce3',
}
params = (
('service', 'https://press.nature.com/callback?client_name=CasClienthttps://press.nature.com'),
('locale', 'en'),
('gtm', 'GTM-WDRMH37'),
('message', 'This page is only accessible for approved journalists. Please log into your press site account. For more information: https://press.nature.com/approve-as-a-journalist'),
('_ga', '2.25951165.1431685211.1610963078-2026442578.1607341887'),
)
data = {
'username': username,
'password': password,
'rememberMe': 'true',
'lt': 'LT-95560-qF7CZnAtuDqWS1sFQgBMqPVifS5mTg-16c07928-2faa-4ce0-58c7-5a1f',
'execution': 'e1s1',
'_eventId': 'submit',
'submit': 'Login'
}
session = requests.session()
response = session.post('https://signon.springer.com/login', headers=headers, params=params, data=data, auth = auth)
print(response)
#time.sleep(5) does not make any diference
soup = BeautifulSoup(response.content, 'html.parser')
print(soup) # im not getting the results that I want
I'm not getting required HTML page with all data that I want, the HTML page that I'm getting is login page. This is the HTML response: https://www.codepile.net/pile/EGY0YQMv
I think that the problem is because I want to scrape this page:
https://press.nature.com/press-releases
But when I click on that link (and Im not logged in) it redirects me to different website for loggin in:
https://signon.springer.com/login
For getting all headers and params and data values I have used:
inspect page -> network -> find login request -> copy cURL -> https://curl.trillworks.com/
I have tried multiple post and get methods, I have tried with and without auth param, but result is the same.
What am I doing wrong?
回答1:
Try running the script filling in your username and password fields and let me know what you get. If it still doesn't log you in, make sure to use additional headers within post requests.
import requests
from bs4 import BeautifulSoup
link = 'https://signon.springer.com/login'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
#what the above line does is parse the keys and valuse available in the login form
payload['username'] = username
payload['password'] = password
print(payload) #when you print this, you should see the required parameters within payload
s.post(link,data=payload)
#as we have laready logged in, the login cookies are stored within the session
#in our subsequesnt requests we are reusing the same session we have been using from the very beginning
r = s.get('https://press.nature.com/press-releases')
print(r.status_code)
print(r.text)
回答2:
Have you tried using selenium alongside bs4 and requests? You can let the browser wait until it selects an element:
driver = webdriver.Chrome()
driver.implicitly_wait(10) #secs
driver.get("https://press.nature.com/press-releases") #redirect to login link
#then login
driver.get("https://press.nature.com/press-releases") #link behind login
So you can go to the login url and login, then go to the place you want to scrape.
回答3:
I think your auth parameter isn't in the correct format to be accepted by requests. You can try importing HTTPBasicAuth
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth(username, password)
来源:https://stackoverflow.com/questions/65774164/scrape-website-that-require-login-with-beautifulsoup