How can I scrape tooltips value from a Tableau graph embedded in a webpage

≯℡__Kan透↙ 提交于 2020-07-09 11:49:28

问题


I am trying to figure out if there is a way and how to scrape tooltip values from a Tableau embedded graph in a webpage using python.

Here is an example of a graph with tooltips when user hovers over the bars:

https://public.tableau.com/views/NumberofCOVID-19patientsadmittedordischarged/DASHPublicpage_patientsdischarges?:embed=y&:showVizHome=no&:host_url=https%3A%2F%2Fpublic.tableau.com%2F&:embed_code_version=3&:tabs=no&:toolbar=yes&:animate_transition=yes&:display_static_image=no&:display_spinner=no&:display_overlay=yes&:display_count=yes&publish=yes&:loadOrderID=1

I grabbed this url from the original webpage that I want to scrape from:

https://covid19.colorado.gov/hospital-data

Any help is appreciated.


回答1:


The graphic seems to be generated in JS from the result of an API which looks like :

POST https://public.tableau.com/TITLE/bootstrapSession/sessions/SESSION_ID 

The SESSION_ID parameter is located (among other things) in tsConfigContainer textarea in the URL used to build the iframe.

Starting from https://covid19.colorado.gov/hospital-data :

  • check element with class tableauPlaceholder
  • get the param element with attribute name
  • it gives you the url : https://public.tableau.com/views/{urlPath}
  • the previous link gives you a textarea with id tsConfigContainer with a bunch of json values
  • extract the session_id and root path (vizql_root)
  • make a POST on https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID with the sheetId as form data
  • extract the json from the result (result is not json)

Code :

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://covid19.colorado.gov/hospital-data")
soup = BeautifulSoup(r.text, "html.parser")

# get the second tableau link
tableauContainer = soup.findAll("div", { "class": "tableauPlaceholder"})[1]
urlPath = tableauContainer.find("param", { "name": "name"})["value"]

r = requests.get(
    f"https://public.tableau.com/views/{urlPath}",
    params= {
        ":showVizHome":"no",
    }
)
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

From there you have all the data. You will need to look for the way the data is splitted as it seems all the data is dumped through a single list. Probably looking at the other fields in the JSON object would be useful for that.



来源:https://stackoverflow.com/questions/61962611/how-can-i-scrape-tooltips-value-from-a-tableau-graph-embedded-in-a-webpage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!