How can I scrape the data from in between these span tags?

问题

I am attempting to scrape the figures shown on https://www.usdebtclock.org/world-debt-clock.html , however due to the numbers constantly changing i am unaware of how to collect this data. This is an example of what i am attempting to do.

import requests
from bs4 import BeautifulSoup

url ="https://www.usdebtclock.org/world-debt-clock.html"
URL=requests.get(url)
site=BeautifulSoup(URL.text,"html.parser")
data=site.find_all("span",id="X4a79R9BW")

print(data)

The result is this:

"[ ]" when i was expecting

"$19,987,137,284,731"

Is there something i can change in order to extract the number?

回答1:

BeautifulSoup cannot do this for you, because the data you need is provided by JavaScript, and BeautifulSoup does not support JS processing.

An alternative is to use a tool such as Selenium WebDriver:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://www.usdebtclock.org/world-debt-clock.html')
elem2 = driver.find_element_by_xpath('//span[@id="X4a79R9BW"]')
print(elem2.text)
driver.close()

If you have not used Selenium WebDriver before, you need to follow the installation instructions here.

In particular, you will need to follow the instructions for downloading the browser driver of your choice (I use geckodriver for Firefox). And make sure the executable is on your path.

(I expect there are other Python-based alternatives, also.)

回答2:

Based on the page's code, I think what you want to accomplish may not be possible with BS. Running your code returned [<span id="X4a79R9BW"> </span>]. Trying to getText() on that returned nothing. When inspecting the page, I noticed that the numerical value in the span was continuously updating as it does on the page. Viewing the page source showed that X4a79R9BW appeared at five places in the page. First to set aspects of the font, several places where an equation was being processed, and last the empty span scraped by your code. From viewing the source, it appears that the counter is an equation running inside a tag <script type="text/javascript">. Here is what I think is the equation running under the JavaScript tag:

{'leftMargin':0,'color':-16751104,:0 */var X3a34729DW = /*144,:14 */    96.9230013  /*751104,:0 */; var R3a45G7S =   /*7104,:54 */  0.000000306947   /*43,451134,:5 */; var Y12 = /*241,:15457 */   18442.16666 /*19601*2*2*/*21600*2*2; /*79301*2*2*/    var Class = new Date(); var Method = Class.getTime() / 1000 - Y12a4798; var Public = X3a34729DW + Method * R3a45G7S;    var Assign = FormatNumber2(Public); document.getElementById   ('X3a34729DW')  .firstChild.nodeValue = Assign; /*'advance':4289}

This section of the page's source indicates that the text you want is being continuously updated via JavaScript. Given that, it is my understanding that BS is not the appropriate library to complete the desired task. Though I have not used it myself, I've seen Selenium as a suggested library for scraping pages dynamically updated via JavaScript. Good luck, perhaps someone else can help provide a clearer path forward.

来源：https://stackoverflow.com/questions/62744087/how-can-i-scrape-the-data-from-in-between-these-span-tags

标签

python

html

web-scraping