How to log on to my wsj account from linux terminal (using curl, oauth2.0)

问题

I'm a paid member of wsj and I want to log onto my wsj account from linux terminal so I can write codes to scrap some articles to for my NLP research. I won't release the data whatsoever.

My approach is based on a previous answer from Scrap articles form wsj by requests, CURL and BeautifulSoup The main issue with the codes that work back then but do not work now is that apparently wsj has adopted a different OAuth 2.0 approach. First, connection I cannot obtain anymore by running login_url. I kinda feel this is the bottleneck. It is a mandatory field for next step.

Another thing I notice is state parameter is used. I don't know how to use this field. After running

curl -s 'https://sso.accounts.dowjones.com/authorize?scope=openid+idp_id+roles+email+given_name+family_name+djid+djUsername+djStatus+trackid+tags+prts&client_id=XXXXXXX&response_type=code&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&state=https://www.wsj.com&username=XXXXXX&password=XXXXXX'

It does return: "Found. Redirecting to /login?state=XXXX...." But not sure how to use the state parameter after this step.

Some references I used are: https://developer.dowjones.com/site/global/develop/authentication/index.gsp#2-exchanging-the-authorization-code-for-authn-tokens-98 https://oauth.net/2/

username="user@gmail.com"
password="YourPassword"

login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")

#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')

rm -f cookies.txt

IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
      --data-urlencode "username=$username" \
      --data-urlencode "password=$password" \
      --data-urlencode "connection=$connection" \
      --data-urlencode "client_id=$client_id" \
      --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')

# replace double quote ""
wctx=$(echo "$wctx" | sed 's/&#34;/"/g')

code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
     --data-urlencode "wa=$wa" \
     --data-urlencode "wresult=$wresult" \
     --data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")

curl -s -c cookies.txt "$code_url"

# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

回答1:

There are a few more parameters needed for the /usernamepassword/login request. It needs the state and nonce. Also it seems the connection field is no longer present in Location header but hardcoded in a js file.

The credentials details are embedded in a Base64 encoded JSON inside a script tag under https://accounts.wsj.com/login

You can update the bash script as the following. It uses curl, jq, sed & pup:

#/bin/bash

username="your_email@gmail.com"
password="your_password"
base_url="https://accounts.wsj.com"

rm -f cookies.txt

login_page=$(curl -s -L -c cookies.txt "$base_url/login")
jspage=$(echo "$login_page" | pup 'script attr{src}' | grep "app-min")
connection=$(curl -s "$base_url$jspage" | sed -rn "s/.*connection:\s*\"(\w+)\".*/\1/p" | head -1)

crendentials=$(echo "$login_page" | \
       sed -rn "s/.*Base64\.decode\('(.*)'.*/\1/p" | \
       base64 -d | \
       jq -r '.internalOptions.state, .internalOptions.nonce, .clientID')

read state nonce clientID < <(echo $crendentials)

echo "state:      $state"
echo "nonce:      $nonce"
echo "client_id:  $clientID"
echo "connection: $connection"

login_result=$(curl -s  -b cookies.txt -c cookies.txt 'https://sso.accounts.dowjones.com/usernamepassword/login' \
      --data-urlencode "username=$username" \
      --data-urlencode "password=$password" \
      --data-urlencode "connection=$connection" \
      --data-urlencode "client_id=$clientID" \
      --data-urlencode "state=$state" \
      --data-urlencode "nonce=$nonce" \
      --data-urlencode "scope=openid idp_id roles email given_name family_name djid djUsername djStatus trackid tags prts" \
      --data 'tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | \
      pup 'input json{}' | jq -r '.[] | .value')

read wa wresult wctx < <(echo $login_result)

wctx=$(echo "$wctx" | sed 's/&#34;/"/g') #replace double quote ""

echo "wa:      $wa"
echo "wresult: $wresult"
echo "wctx:    $wctx"

callback=$(curl -s -b cookies.txt -c cookies.txt -L 'https://sso.accounts.dowjones.com/login/callback' \
     --data-urlencode "wa=$wa" \
     --data-urlencode "wresult=$wresult" \
     --data-urlencode "wctx=$wctx")

#try this one to get an article, your username should be embedded in the page as logged in user
#curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"

But this bash script is painful to maintain, I'd recommend to use a python script like this:

import requests
from bs4 import BeautifulSoup
import re
import base64
import json

username="your_email@gmail.com"
password="your_password"
base_url="https://accounts.wsj.com"

session = requests.Session()
r = session.get("{}/login".format(base_url))
soup = BeautifulSoup(r.text, "html.parser")
jscript = [ 
    t.get("src") 
    for t in soup.find_all("script") 
    if t.get("src") is not None and "app-min" in t.get("src")
][0]

credentials_search = re.search("Base64\.decode\('(.*)'", r.text, re.IGNORECASE)
base64_decoded = base64.b64decode(credentials_search.group(1))
credentials = json.loads(base64_decoded)

print("client_id : {}".format(credentials["clientID"]))
print("state     : {}".format(credentials["internalOptions"]["state"]))
print("nonce     : {}".format(credentials["internalOptions"]["nonce"]))
print("scope     : {}".format(credentials["internalOptions"]["scope"]))

r = session.get("{}{}".format(base_url, jscript))

connection_search = re.search('connection:\s*\"(\w+)\"', r.text, re.IGNORECASE)
connection = connection_search.group(1)

r = session.post(
    'https://sso.accounts.dowjones.com/usernamepassword/login',
    data = {
        "username": username,
        "password": password,
        "connection": connection,
        "client_id": credentials["clientID"],
        "state": credentials["internalOptions"]["state"],
        "nonce": credentials["internalOptions"]["nonce"],
        "scope": credentials["internalOptions"]["scope"],
        "tenant": "sso",
        "response_type": "code",
        "protocol": "oauth2",
        "redirect_uri": "https://accounts.wsj.com/auth/sso/login"
    })
soup = BeautifulSoup(r.text, "html.parser")

login_result = dict([ 
    (t.get("name"), t.get("value")) 
    for t in soup.find_all('input') 
    if t.get("name") is not None
])

r = session.post(
    'https://sso.accounts.dowjones.com/login/callback',
    data = login_result)

#check connected user
r = session.get("https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y")
username_search = re.search('\"firstName\":\s*\"(\w+)\",', r.text, re.IGNORECASE)
print("connected user : " + username_search.group(1))

来源：https://stackoverflow.com/questions/58575756/how-to-log-on-to-my-wsj-account-from-linux-terminal-using-curl-oauth2-0

标签

bash

curl

web-scraping