问题
I'm a paid member of wsj and I tried to scrap articles to do my NLP project. I thought I kept the session.
rs = requests.session()
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin"
payload={
"username":"xxx@email",
"password":"myPassword",
}
result = rs.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
The article I want to parse.
r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y')
Then I found the html is still the one for non-member
I also tried another method by using CURL to save the cookies after I login
curl -c cookies.txt -I "https://www.wsj.com"
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html
The result is the same.
I'm not very familiar with the mechanism how the authencation work behind the browser. Can someone explains why both the methods above are failed and how should I fix it to get my goal. Thanks you very much.
回答1:
Your attempts have failed because the protocol used is oauth2.0. This is not basic authentication.
What's happening here is :
- some information are generated server side when login URL
https://accounts.wsj.com/loginis called :connection&client_id - when submitting username/password, the URL
https://sso.accounts.dowjones.com/usernamepassword/loginis called which needs some parameter (the previousconnection&client_id+ some static parameter for oauth2 :scope,response_type,redirect_uri - a response is received from the previous login call that gives a form which auto-submit. This form has 3 params
wa,wresultandwctx(wresultis a JWT). This form performs the call tohttps://sso.accounts.dowjones.com/login/callbackto retrieve an URL with a code param likecode=AjKK8g0pZZfvYpju - The URL
https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpjuis called which retrieve the cookies with a valid user session
The bash script which uses curl, grep, pup and jq :
username="user@gmail.com"
password="YourPassword"
login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")
#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')
rm -f cookies.txt
IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
--data-urlencode "username=$username" \
--data-urlencode "password=$password" \
--data-urlencode "connection=$connection" \
--data-urlencode "client_id=$client_id" \
--data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')
# replace double quote ""
wctx=$(echo "$wctx" | sed 's/"/"/g')
code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
--data-urlencode "wa=$wa" \
--data-urlencode "wresult=$wresult" \
--data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")
curl -s -c cookies.txt "$code_url"
# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"
来源:https://stackoverflow.com/questions/44965524/scrap-articles-form-wsj-by-requests-curl-and-beautifulsoup