BeautifulSoup html missing

ぐ巨炮叔叔 提交于 2019-12-23 05:43:06

问题


I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day.

So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d

I would want to acquire this (from the "Download Data" link above the table of data):

"https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200&interval=1d&events=history&crumb=iX6bJ6LfGxc"

I'm using BeautifulSoup and am running into the problem of the required tag that holds the href not showing up in the html. At first I thought BeautifulSoup was just not working properly after getting no results from trying to use find_all('a') and iterating through children/decendants. But when I did a text dump of the html, the html element (along with everything else within the parent element) was not there. Can someone please explain what is going on? What I'm currently working with is listed below.

from bs4 import BeautifulSoup
import datetime as dTime
import requests

"""
asset = "Materials"
assetSignal = "XLB"
today = dTime.datetime.now()
startTime = str(int(dTime.datetime(1999, 1, 1, 0, 0, 0).timestamp()))
endTime = str(int(dTime.datetime(today.year, today.month, today.day, 0, 0, 0).timestamp()))
url = "https://finance.yahoo.com/quote/" + assetSignal + "/history?period1=" + startTime + "&period2=" + endTime + "&interval=1d&filter=history&frequency=1d"
"""

url = "https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d"
page = requests.get(url)
data = page.content
#soup = BeautifulSoup(data, "html.parser")
soup = BeautifulSoup(data, "lxml")
#soup = BeautifulSoup(data, "xml")
#soup = BeautifulSoup(data, "html5lib")

#Link not found
for link in soup.find_all("a"):
    print(link.get("href"))

#Span is empty?
span = soup.find(class_="Fl(end) Pos(r) T(-6px)")
print(span)
print(span.string)
print(span.contents)
for child in span.children:
    print(child)

#Other span has children.  Target span doesn't
div = soup.find(class_="C($finDarkGray) Mt(20px) Mb(15px)")
print(div)
for child in div.descendants:
    print(child)

#Is the tag even there?
with open("soup.txt", "w") as file:
    file.write(page.text)

回答1:


This website relies heavily on Javascript. A lot of the information you see on your browser doesn't come in the first request you make to the website but it's added by subsequent Javascript making additional requests.

Try to use their API instead or use something like Selenium that emulates a web browser.



来源:https://stackoverflow.com/questions/44813741/beautifulsoup-html-missing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!