How to act when not receiving the data when scrapping with python?

走远了吗. 提交于 2021-01-29 09:16:40

问题


This site has data on stock and I'm trying to sub struct some data from this site. https://quickfs.net/company/AAPL:US

Where AAPL is a stock name and can be changed.

the page looks like a big table : the columns are years and the rows are calculated values like: Return on Assets and Gross Margin

For this I tried to follow few tutorials:

Introduction to Web Scraping (Python) - Lesson 02 (Scrape Tables)

Intro to Web Scraping with Python and Beautiful Soup

Web Scraping HTML Tables with Python

Web scraping with Python — A to Z Part A — Handling BeautifulSoup and avoiding blocks

I get stuck right at the beginning after importing the packages:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

this function to retrive the data from the web page:

def make_soup(url):
    thepage=uReq(url)
    soupdata=soup(thepage, "html.parser")
    return(soupdata)

then

soup=make_soup("https://quickfs.net/company/AAPL:US")

Now, when trying to look what data inside the soup

soup.text

The output is just this and not all the data from the webpage:

'\n\n\n\n\n\n\n\n\n\n\n\nExport Fundamental Data U.S. and International Stocks - QuickFS.net\n\n\n\n\n\n  \r\n  Loading QuickFS...\r\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n'

I think it's a problem with the specific web page but I have no idea how to handle with this.

Entering different url the the function make_soup(url) sometimes do work.

Pleas your kind help


回答1:


That is because that page is fully dynamic, meaning that javascript is doing all the work and BeautifulSoup4 doesn't run JS.

You have to choices here:

  • A) Switch to something like Selenium
  • B) Check what XHR messages the site is sending to the api/server and try to emulate that from python.

In the case of B, you would see that the site is making this call:

curl 'https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/' \
-XGET \
-H 'Accept: application/json, text/plain, */*' \
-H 'Content-Type: application/json' \
-H 'Origin: https://quickfs.net' \
-H 'Accept-Language: en-us' \
-H 'Host: api.quickfs.net' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15' \
-H 'Referer: https://quickfs.net/company/AAPL:US' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Connection: keep-alive' \
-H 'X-Auth-Token: ' \
-H 'X-Referral-Code: '

What you can do is this instead:

import requests

response = request.get("https://api.quickfs.net/stocks/AAPL:US/ovr/Annual/")
data = response.json()

Where data will be the raw data that the site uses to present the info:

{
    "datasets": {
        "metadata": {
            "_id": {},
            "qfs_symbol": "NAS:AAPL",
            "currency": "USD",
            "fsCat": "normal",
            "name": "Apple Inc.",
            "gs3_version_at_metadata_update": 20191106,
            "exchange": "NASDAQ",
            "industry": "Technology Hardware & Equipment",
            "symbol": "AAPL",
            "country": "US",
            "price": 278.58,
        ...
    }
}


来源:https://stackoverflow.com/questions/61531369/how-to-act-when-not-receiving-the-data-when-scrapping-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!