BeautifulSoup html missing

问题

I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day.

So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d

I would want to acquire this (from the "Download Data" link above the table of data):

"https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&amp;period2=1498633200&amp;interval=1d&amp;events=history&amp;crumb=iX6bJ6LfGxc"

I'm using BeautifulSoup and am running into the problem of the required tag that holds the href not showing up in the html. At first I thought BeautifulSoup was just not working properly after getting no results from trying to use find_all('a') and iterating through children/decendants. But when I did a text dump of the html, the html element (along with everything else within the parent element) was not there. Can someone please explain what is going on? What I'm currently working with is listed below.

from bs4 import BeautifulSoup
import datetime as dTime
import requests

"""
asset = "Materials"
assetSignal = "XLB"
today = dTime.datetime.now()
startTime = str(int(dTime.datetime(1999, 1, 1, 0, 0, 0).timestamp()))
endTime = str(int(dTime.datetime(today.year, today.month, today.day, 0, 0, 0).timestamp()))
url = "https://finance.yahoo.com/quote/" + assetSignal + "/history?period1=" + startTime + "&period2=" + endTime + "&interval=1d&filter=history&frequency=1d"
"""

url = "https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d"
page = requests.get(url)
data = page.content
#soup = BeautifulSoup(data, "html.parser")
soup = BeautifulSoup(data, "lxml")
#soup = BeautifulSoup(data, "xml")
#soup = BeautifulSoup(data, "html5lib")

#Link not found
for link in soup.find_all("a"):
    print(link.get("href"))

#Span is empty?
span = soup.find(class_="Fl(end) Pos(r) T(-6px)")
print(span)
print(span.string)
print(span.contents)
for child in span.children:
    print(child)

#Other span has children.  Target span doesn't
div = soup.find(class_="C($finDarkGray) Mt(20px) Mb(15px)")
print(div)
for child in div.descendants:
    print(child)

#Is the tag even there?
with open("soup.txt", "w") as file:
    file.write(page.text)

回答1:

This website relies heavily on Javascript. A lot of the information you see on your browser doesn't come in the first request you make to the website but it's added by subsequent Javascript making additional requests.

Try to use their API instead or use something like Selenium that emulates a web browser.

来源：https://stackoverflow.com/questions/44813741/beautifulsoup-html-missing

标签

python

html

beautifulsoup

html-parsing