Python Requests (Web Scraping) - Building a cookie to be able to view data in a website

问题

I'm trying to scrape a finance website to make an application that compares the accuracy of financial data from various other websites (google/yahoo finance). This is a personal project that I started really just to learn Python programming and writing scripts.

The URL I am trying to scrape (specifically the stock's "Key Data" like Market Cap, Volume, Etc) is here:

https://www.marketwatch.com/investing/stock/sbux

I've figured out (with the help of others) that a cookie must be built and sent with each request in order for the page to display the data (otherwise the page html response pretty much returns empty).

I used Opera/Firefox/Chrome browsers to look into the HTTP Headers and requests that are being sent back from the browser. I've come to the conclusion that there are 3 steps/requests that need to be done to receive all the cookie data and build it piece by piece.

Step/Request 1

Simply visiting the above URL.

GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 579
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache

Step/Request 2

I am not sure where this "POST" URL came from. However, using Firefox and viewing network connections this url popped up in the "Trace Stack" tab. Again, I have no idea where to get this URL if its the same for everyone or randomly created. I also don't know what POST data is being sent or where the values of X-Hash-Result or X-Token-Value came from. However, this request returns a very important value in the response header with the following line: 'Set-Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d' this piece of the cookie is crucial for the next request in order to return the full cookie and receive the data on the web-page.

POST /149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint HTTP/1.1
Host: www.marketwatch.com:443
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Content-Type: application/json; charset=UTF-8
Origin: https://www.marketwatch.com
Referer: https://www.marketwatch.com/investing/stock/sbux
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44
X-Hash-Result: 701c19ee3f45d07b56b40fb8e313214d
X-Token-Value: 900c4055-ef7a-74a8-e9ec-f78f7edc363b

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 17
Content-Type: application/json; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache
Set-Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d; Path=/; HttpOnly

Step/Request 3

This request is sent to the original URL with the cookie picked up in step 2. The full cookie is then returned in the response which can be used in step 1 to avoid going through step 2 and 3 again. It will also display the full page of data.

GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d
Referer: https://www.marketwatch.com/investing/stock/sbux
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44

HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 62944
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:17 GMT
Expires: Sun, 26 Aug 2018 05:12:17 GMT
Pragma: no-cache
Server: Kestrel
Set-Cookie: seenads=0; expires=Sun, 26 Aug 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Set-Cookie: mw_loc=%7B%22country%22%3A%22CA%22%2C%22region%22%3A%22ON%22%2C%22city%22%3A%22MARKHAM%22%2C%22county%22%3A%5B%22%22%5D%2C%22continent%22%3A%22NA%22%7D; expires=Sat, 01 Sep 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Vary: Accept-Encoding
x-frame-options: SAMEORIGIN
x-machine: 8cfa9f20bf3eb

Summary

In summary, step 2 is the most important to get the remaining cookie piece... But I can't figure out the 3 things:

1) Where the POST url comes from (not embedded in original page, is the URL the same for everyone or is it randomly generated by the site).

2) What's the data being sent in the POST request?

3) Where do X-Hash-Result and X-Token-Value come from? Is it required to be sent in the header with the request?

This was a good challenge for me that I've spent a few hours on (I'm also very very new to python and HTTP Web Requests) and so I feel someone with more experience might be able to solve this in a more timely manner.

Thank you all for anyone that can help.

回答1:

Hello again FromThe6ix!

I spent some time tonight trying to get the cookie string appending to work. MarketWatch has done a fairly decent job protecting their data. In order to build the entire cookie you will need a wsj API key (their site's finance data supplier I think) and some hidden variables that are potentially only available to the client's server and strictly withheld based on your web-driver or lack thereof.

For example if you try to hit with requests: POST https://browser.pipe.aria.microsoft.com/Collector/3.0/?qsp=true&content-type=application/bond-compact-binary&client-id=NO_AUTH&sdk-version=ACT-Web-JS-2.7.1&x-apikey=c34cce5c21da4a91907bc59bce4784fb-42e261e9-5073-49df-a2e1-42415e012bc6-6954

You'll get an 400 unauthorized error.

Remember there is also a good chance that the client host server cluster master and the various APIs it communicates with are communicating without our browsers being able to pick up the network traffic. This could be done through a middleware of some sort for example. I believe this could account for the missing X-Hash-Result and X-Token-Value values.

I am not saying it is impossible to build this cookie string, just that it is an inefficient route to take in terms of development time and effort. I also now question this method's ease of scalability in terms of using different tickers besides AAPL. Unless there is an explicit requirement to not use a web-driver and/or the script needs to be highly portable without any configuration allowed outside of pip install, I wouldn't choose this method.

That essentially leaves us with either a Scrapy Spider or a Selenium Scraper (and a little extra environment configuration unfortunately, but very important skills to learn if you want to write and deploy web scrapers. Generally speaking, requests + bs4 is for ideal easy scrapes/unusual code portability needs).

I went ahead and wrote a Selenium Scraper ETL Class using a PhantomJS Web-driver for you. It accepts a ticker string as a parameter and works on other stocks besides AAPL. It was tricky since marketwatch.com will not redirect traffic from a PhantomJS Web-driver (I can tell that they have spent a lot of resources trying to discourage web scrapers btw. Much much more so than say yahoo.com).

Anyway here is the final Selenium Script, it runs on python 2 and 3:

# Market Watch Test Scraper ETL
# Tested on python 2.7 and 3.5
# IMPORTANT: Ensure PhantomJS Web Driver is configured and installed

import pip
import sys
import signal
import time


# Package installer function to handle missing packages
def install(package):
    print(package + ' package for Python not found, pip installing now....')
    pip.main(['install', package])
    print(package + ' package has been successfully installed for Python\n Continuing Process...')

# Ensure beautifulsoup4 is installed
try:
    from bs4 import BeautifulSoup
except:
    install('beautifulsoup4')
    from bs4 import BeautifulSoup

# Ensure selenium is installed
try:
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
except:
    install('selenium')
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


# Class to extract and transform raw marketwatch.com financial data
class MarketWatchETL:

    def __init__(self, ticker):
        self.ticker = ticker.upper()
        # Set up desired capabilities to spoof Firefox since marketwatch.com rejects any PhantomJS Request
        self._dcap = dict(DesiredCapabilities.PHANTOMJS)
        self._dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) "
                                                           "AppleWebKit/537.36 (KHTML, like Gecko) "
                                                           "Chrome/29.0.1547.57 Safari/537.36")
        self._base_url = 'https://www.marketwatch.com/investing/stock/'
        self._retries = 10

    # Private Static Method to clean and organize Key Data Extract
    @staticmethod
    def _cleaned_key_data_object(raw_data):
        cleaned_data = {}
        raw_labels = raw_data['labels']
        raw_values = raw_data['values']
        i = 0
        for raw_label in raw_labels:
            raw_value = raw_values[i]
            cleaned_data.update({str(raw_label.get_text()): raw_value.get_text()})
            i += 1
        return cleaned_data

    # Private Method to scrap data from MarketWatch's web page
    def _scrap_financial_key_data(self):
        raw_data_obj = {}
        try:
            driver = webdriver.PhantomJS(desired_capabilities=self._dcap)
        except:
            print('***SETUP ERROR: The PhantomJS Web Driver is either not configured or incorrectly configured!***')
            sys.exit(1)
        driver.get(self._base_url + self.ticker)
        i = 0
        while i < self._retries:
            try:
                time.sleep(3)
                html = driver.page_source
                soup = BeautifulSoup(html, "html.parser")
                labels = soup.find_all('small', class_="kv__label")
                values = soup.find_all('span', class_="kv__primary")
                if labels and values:
                    raw_data_obj.update({'labels': labels})
                    raw_data_obj.update({'values': values})
                    break
                else:
                    i += 1
            except:
                i += 1
                continue
        if i == self._retries:
            print('Please check your internet connection!\nUnable to connect...')
            sys.exit(1)
        driver.service.process.send_signal(signal.SIGTERM)
        driver.quit()
        return raw_data_obj

    # Public Method to return a Stock's Key Data Object
    def get_stock_key_data(self):
        raw_data = self._scrap_financial_key_data()
        return self._cleaned_key_data_object(raw_data)


# Script's Main Process to test MarketWatchETL('TICKER')
if __name__ == '__main__':

    # Run financial key data extracts for Microsoft, Apple, and Wells Fargo
    msft_key_data = MarketWatchETL('MSFT').get_stock_key_data()
    aapl_key_data = MarketWatchETL('AAPL').get_stock_key_data()
    wfc_key_data = MarketWatchETL('WFC').get_stock_key_data()

    # Print result dictionaries
    print(msft_key_data.items())
    print(aapl_key_data.items())
    print(wfc_key_data.items())

Which outputs:

dict_items([('Rev. per Employee', '$841.03K'), ('Short Interest', '44.63M'), ('Yield', '1.53%'), ('Market Cap', '$831.23B'), ('Open', '$109.27'), ('EPS', '$2.11'), ('Shares Outstanding', '7.68B'), ('Ex-Dividend Date', 'Aug 15, 2018'), ('Day Range', '108.51 - 109.64'), ('Average Volume', '25.43M'), ('Dividend', '$0.42'), ('Public Float', '7.56B'), ('P/E Ratio', '51.94'), ('% of Float Shorted', '0.59%'), ('52 Week Range', '72.05 - 111.15'), ('Beta', '1.21')])
dict_items([('Rev. per Employee', '$2.08M'), ('Short Interest', '42.16M'), ('Yield', '1.34%'), ('Market Cap', '$1.04T'), ('Open', '$217.15'), ('EPS', '$11.03'), ('Shares Outstanding', '4.83B'), ('Ex-Dividend Date', 'Aug 10, 2018'), ('Day Range', '216.33 - 218.74'), ('Average Volume', '24.13M'), ('Dividend', '$0.73'), ('Public Float', '4.82B'), ('P/E Ratio', '19.76'), ('% of Float Shorted', '0.87%'), ('52 Week Range', '149.16 - 219.18'), ('Beta', '1.02')])
dict_items([('Rev. per Employee', '$384.4K'), ('Short Interest', '27.44M'), ('Yield', '2.91%'), ('Market Cap', '$282.66B'), ('Open', '$58.87'), ('EPS', '$3.94'), ('Shares Outstanding', '4.82B'), ('Ex-Dividend Date', 'Aug 9, 2018'), ('Day Range', '58.76 - 59.48'), ('Average Volume', '18.45M'), ('Dividend', '$0.43'), ('Public Float', '4.81B'), ('P/E Ratio', '15.00'), ('% of Float Shorted', '0.57%'), ('52 Week Range', '49.27 - 66.31'), ('Beta', '1.13')])

The only extra step you will need to do prior to running this is to install and configure the PhantomJS Web-driver on your deployment environments. If you need to automate the deployment of a web-scraper like this you could write a bash/power shell installer script to handle pre-configuring your environment's PhantomJS.

Some resources for Installing and Configuring PhantomJS:

Windows/Mac PhantomJS Installation Executables

Debian Linux PhantomJS Installation Guide

RHEL PhantomJS Installation Guide

If you consider this an incomplete answer, then I apologize in advance. I just doubt the practicability and even possibility of assembling the Cookie in the manner I suggested in your prior post.

I think the other practical possibility here is to write a Scrapy Crawler, which I can attempt to do for you tomorrow night if you want.

Hope this all helps!

来源：https://stackoverflow.com/questions/52027493/python-requests-web-scraping-building-a-cookie-to-be-able-to-view-data-in-a

标签

python

web-scraping

httprequest

python-responses