Scraping Data from a Tableau Map

廉价感情. 提交于 2021-02-09 08:46:27

问题


I am trying to pull locations and names of Naloxone distribution centers in Illinois for a research project on the opioid crisis.

This tableau generated dashboard is accessible from here from the department of public health https://idph.illinois.gov/OpioidDataDashboard/

I've tried everything I could find. First changing the url to "download" the data using Tableau's interface. That only let me download a pdf map not the actual dataset behind it. Second, I modified the python script I've seen a few times on Stack overflow to try to request the data. But, I think it runs into some kind of error. Code below.

url = "https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/NaloxoneDistributionLocations"

r = requests.get(
    url,
    params= {
        ":embed":"y",
        ":showAppBanner":"false",
        ":showShareOptions":"true",
        ":display_count":"no",
        "showVizHome": "no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
print(soup)
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://tableau.ons.org.br{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

Appreciate any help.


回答1:


It's a bit complex since there are a combination of the following :

  • the tableau "configuration page" where there is tsconfig textarea is not part of the original page. The url is built dynamically from some param html tag
  • it uses a cross forgery token in cookies but in order to get that cookie you need to call a specific api whose url is built dynamically from some param html tag
  • from the tsconfig parameter, we can build the data url as you've found out in other stackoverflow post such as this, this and this

The flow is the following :

  • call GET https://idph.illinois.gov/OpioidDataDashboard/, scrape the param tags under the div with class tableauPlaceholder

From there the host is : https://interactive.data.illinois.gov

  • from the former param tags, build the "session URL" which looks like this :

    GET /trusted/{ticket}/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
    

the url above will be only used to store the cookies (including xsrf token in the cookies)

  • from the former param tags, build the "configuration URL" which looks like this :

    GET /t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
    

Extract the textarea with id tsConfigContainer and parse the json from it

  • build the "data url" from the json extracted above, the url looks like this :

    POST /vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/{session_id}
    

Then you have a json response with some string in front of it to prevent json hijacking. You need regex to extract it and then parse the huge json data

All url needed would be like :

GET https://idph.illinois.gov/OpioidDataDashboard/
GET https://interactive.data.illinois.gov/trusted/yIm7jkXyRQuH9Ff1oPvz_w==:790xMcZuwmnvijXHg6ymRTrU/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
GET https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
POST https://interactive.data.illinois.gov/vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/2A3E3BA96A6C4E65B36AEDB4A536D09F-1:0

The full code:

import requests
from bs4 import BeautifulSoup
import json
import re

s = requests.Session()

init_url = "https://idph.illinois.gov/OpioidDataDashboard/"
print(f"GET {init_url}")
r = s.get(init_url)
soup = BeautifulSoup(r.text, "html.parser")
paramTags = dict([
    (t["name"], t["value"]) 
    for t in soup.find("div", {"class":"tableauPlaceholder"}).findAll("param")
])

# get xsrf cookie
session_url = f'{paramTags["host_url"]}trusted/{paramTags["ticket"]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {session_url}")
r = s.get(session_url)

config_url = f'{paramTags["host_url"][:-1]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {config_url}")
r = s.get(config_url,
    params = {
        ":embed": "y",
        ":showVizHome": "no",
        ":host_url": "https://interactive.data.illinois.gov/",
        ":embed_code_version": 2,
        ":tabs": "yes",
        ":toolbar": "no",
        ":showShareOptions": "false",
        ":display_spinner": "no",
        ":loadOrderID": 0,
})
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'{paramTags["host_url"][:-1]}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
print(f"POST {dataUrl}")
r = s.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

Try this on repl.it



来源:https://stackoverflow.com/questions/63025296/scraping-data-from-a-tableau-map

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!