问题
I want to scrap data from this website. The data under "IBVC:IND Caracas Stock Exchange Stock Market Index" needs to be scrapped. I am using beautiful soup and request.
used beautiful soup and requests
import requests
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/58.0.3029.110 Safari/537.36 '
}
res = requests.get("https://www.bloomberg.com/quote/IBVC:IND", headers=headers)
soup = bs(res.content, 'html.parser')
# print(soup)
itmes = soup.find("div", {"class": "snapshot__0569338b snapshot"})
open_ = itmes.find("span", {"class": "priceText__1853e8a5"}).text
print(open_)
prev_close = itmes.find("span", {"class": "priceText__1853e8a5"}).text
I can't find the required values in html. Which library i should use to handle that.
回答1:
As indicated in other answers, the content is generated via JavaScript, hence not inside the plain html. For the given problem, two different angles of attack have been proposed
Seleniumaka The Big Guns: This will let you automate virtually any task in a browser. Comes at a certain cost though in terms of speed.API Requestaka Thought Through: This is not always feasible. When it is however the case then it is much more efficient.
I elaborate on the second one. @ViniciusDAvila already laid out the typical blueprint for such a solution: navigate to the site, inspect the Network and figure out which request is responsible for fetching the data.
Once this is done, the rest is a matter of execution:
Scraper
import requests
import json
from urllib.parse import quote
# Constants
HEADERS = {
'Host': 'www.bloomberg.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': '*/*',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.bloomberg.com/quote/',
'DNT': '1',
'Connection': 'keep-alive',
'TE': 'Trailers'
}
URL_ROOT = 'https://www.bloomberg.com/markets2/api/datastrip'
URL_PARAMS = 'locale=en&customTickerList=true'
VALID_TYPE = {'currency', 'index'}
# Scraper
def scraper(object_id: str = None, object_type: str = None, timeout: int = 5) -> list:
"""
Get the Bloomberg data for the given object.
:param object_id: The Bloomberg identifier of the object.
:param object_type: The type of the object. (Currency or Index)
:param timeout: Maximal number of seconds to wait for a response.
:return: The data formatted as dictionary.
"""
object_type = object_type.lower()
if object_type not in VALID_TYPE:
return list()
# Build headers and url
object_append = '%s:%s' % (object_id, 'IND' if object_type == 'index' else 'CUR')
headers = HEADERS
headers['Referer'] += object_append
url = '%s/%s?%s' % (URL_ROOT, quote(object_append), URL_PARAMS)
# Make the request and check response status code
response = requests.get(url=url, headers=headers)
if response.status_code in range(200, 230):
return response.json()
return list()
Test
# Index
object_id, object_type = 'IBVC', 'index'
data = scraper(object_id=object_id, object_type=object_type)
print('The open price for %s %s is: %d' % (object_type, object_id, data[0]['openPrice']))
# The open price for index IBVC is: 50094
# Exchange rate
object_id, object_type = 'EUR', 'currency'
data = scraper(object_id=object_id, object_type=object_type)
print('The open exchange rate for USD per {} is: {}'.format(object_id, data[0]['openPrice']))
# The open exchange rate for USD per EUR is: 1.0993
回答2:
Since that's not a static page, you need to make a request to the Bloomberg API. To find out how, go to the page, inspect element and select "Network", then filter by "XHR" and look for JSON types. Reload the page. I did that and believe this is what you want: link
回答3:
As the required values are dynamically loaded. In this case, you may try with selenium and BeautifulSoup. Here is a sample code for your reference:
import time
import os
from selenium import webdriver
from bs4 import BeautifulSoup
# put the driver in the folder of this code
driver = webdriver.Chrome(os.getcwd() + '/chromedriver')
driver.get("https://www.bloomberg.com/quote/IBVC:IND")
time.sleep(3)
real_soup = BeautifulSoup(driver.page_source, 'html.parser')
open_ = real_soup.find("span", {"class": "priceText__1853e8a5"}).text
print(f"Price: {open_}")
time.sleep(3)
driver.quit()
Output:
Price: 50,083.00
You can search for chromedriver and download one based on your chrome version.
来源:https://stackoverflow.com/questions/58064494/scrap-data-from-bloomberg