How to scrape data off morningstar

╄→尐↘猪︶ㄣ 提交于 2020-07-22 05:19:13

问题


So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph.

After inspecting the element of the piece of data I wanted to extract, I saw this: <span id="TSMAIN">: 100.7490637</span> The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list containing all the different data points (if that makes sense?).

Instead, it seems that all the data points are contained within this same element, and the value depends on where your cursor is on the graph.

My problem is, If I use beautifulsoups find function and type in that specific element with that attribute of id = TSMAIN, I get a none type return, because I am guessing unless I have my cursor on the actual graph nothing will show up there.

Code:

from bs4 import BeautifulSoup 
import requests
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.morningstar.co.uk/uk/funds/snapshot/snapshot.aspx?id=F0GBR050AQ&tab=13"
source=requests.get(url,headers=headers)
soup = BeautifulSoup(source.content,'lxml')
data = soup.find("span",attrs={"id":"TSMAIN"})
print(data)

Output

None

How can I extract all the data points of this graph?


回答1:


Seems like the data can be pulled form API. Only thing is the values it returns is relative to the start date entered in the payload. It'll set the out put of the start date to 0, then the numbers after are relative to that date.

import requests
import pandas as pd
from datetime import datetime
from dateutil import relativedelta

userInput = input('Choose:\n\t1. 3 Month\n\t2. 6 Month\n\t3. 1 Year\n\t4. 3 Year\n\t5. 5 Year\n\t6. 10 Year\n\n -->: ')
userDict = {'1':3,'2':6,'3':12,'4':36,'5':60,'6':120}


n = datetime.now()
n = n - relativedelta.relativedelta(days=1)
n = n - relativedelta.relativedelta(months=userDict[userInput])
dateStr = n.strftime('%Y-%m-%d')


url = 'https://tools.morningstar.co.uk/api/rest.svc/timeseries_cumulativereturn/t92wz0sj7c'

data = []
idDict = {
        'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL',
        'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL',
        'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL',
        '':'F00000ZOR1]7]0]IXALL$$ALL',}


for k, v in idDict.items():
    payload = {
    'encyId': 'GBP',
    'idtype': 'Morningstar',
    'frequency': 'daily',
    'startDate':  dateStr,
    'performanceType': '',
    'outputType': 'COMPACTJSON',
    'id': v,
    'decPlaces': '8',
    'applyTrackRecordExtension': 'false'}
    
    
    temp_data = requests.get(url, params=payload).json()
    df = pd.DataFrame(temp_data)
    df['timestamp'] = pd.to_datetime(df[0], unit='ms')
    df['date'] = df['timestamp'].dt.date 
    df = df[['date',1]]  
    df.columns = ['date', k]
    data.append(df)         

final_df = pd.concat(
    (iDF.set_index('date') for iDF in data),
    axis=1, join='inner'
).reset_index()


final_df.plot(x="date", y=list(idDict.keys()), kind="line")

Output:

print (final_df.head(5).to_string())
         date  Schroder Managed Balanced Instl Acc  GBP Moderately Adventurous Allocation  Mixed Investment 40-85% Shares          
0  2019-12-22                             0.000000                               0.000000                        0.000000  0.000000
1  2019-12-23                             0.357143                               0.406784                        0.431372  0.694508
2  2019-12-24                             0.714286                               0.616217                        0.632422  0.667586
3  2019-12-25                             0.714286                               0.616217                        0.632422  0.655917
4  2019-12-26                             0.714286                               0.612474                        0.629152  0.664124
....

To get those Ids, it took a little investigating of the requests. Searching through those, I was able to find the corresponding id values and with a little bit of trial and error to work out what values meant what.

Those "alternate" ids used. And where those line graphs get the data from (inthose 4 request, look at the Preview pane, and you'll see the data in there.

Here's the final output/graph:



来源:https://stackoverflow.com/questions/62530100/how-to-scrape-data-off-morningstar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!