How to act when not receiving the data when scrapping with python?

问题

This site has data on stock and I'm trying to sub struct some data from this site. https://quickfs.net/company/AAPL:US

Where AAPL is a stock name and can be changed.

the page looks like a big table : the columns are years and the rows are calculated values like: Return on Assets and Gross Margin

For this I tried to follow few tutorials:

Introduction to Web Scraping (Python) - Lesson 02 (Scrape Tables)

Intro to Web Scraping with Python and Beautiful Soup

Web Scraping HTML Tables with Python

Web scraping with Python — A to Z Part A — Handling BeautifulSoup and avoiding blocks

I get stuck right at the beginning after importing the packages:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

this function to retrive the data from the web page:

def make_soup(url):
    thepage=uReq(url)
    soupdata=soup(thepage, "html.parser")
    return(soupdata)

then

soup=make_soup("https://quickfs.net/company/AAPL:US")

Now, when trying to look what data inside the soup

soup.text

The output is just this and not all the data from the webpage:

'\n\n\n\n\n\n\n\n\n\n\n\nExport Fundamental Data U.S. and International Stocks - QuickFS.net\n\n\n\n\n\n  \r\n  Loading QuickFS...\r\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n'

I think it's a problem with the specific web page but I have no idea how to handle with this.

Entering different url the the function make_soup(url) sometimes do work.

Pleas your kind help

回答1:

That is because that page is fully dynamic, meaning that javascript is doing all the work and BeautifulSoup4 doesn't run JS.

You have to choices here:

A) Switch to something like Selenium
B) Check what XHR messages the site is sending to the api/server and try to emulate that from python.

In the case of B, you would see that the site is making this call:

curl 'https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/' \
-XGET \
-H 'Accept: application/json, text/plain, */*' \
-H 'Content-Type: application/json' \
-H 'Origin: https://quickfs.net' \
-H 'Accept-Language: en-us' \
-H 'Host: api.quickfs.net' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15' \
-H 'Referer: https://quickfs.net/company/AAPL:US' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Connection: keep-alive' \
-H 'X-Auth-Token: ' \
-H 'X-Referral-Code: '

What you can do is this instead:

import requests

response = request.get("https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/")
data = response.json()

Where data will be the raw data that the site uses to present the info:

{
    "datasets": {
        "metadata": {
            "_id": {},
            "qfs_symbol": "NAS:AAPL",
            "currency": "USD",
            "fsCat": "normal",
            "name": "Apple Inc.",
            "gs3_version_at_metadata_update": 20191106,
            "exchange": "NASDAQ",
            "industry": "Technology Hardware & Equipment",
            "symbol": "AAPL",
            "country": "US",
            "price": 278.58,
        ...
    }
}

来源：https://stackoverflow.com/questions/61531369/how-to-act-when-not-receiving-the-data-when-scrapping-with-python

标签

python

web-scraping

beautifulsoup