How to extract data from javascript in a json format?

为君一笑 提交于 2019-12-19 11:39:32

问题


I am getting a hardtime extracting the data First I need to extract the title post and the posted date of the post here's the url.

URL: https://cheddar.com/media/safety-concerns-over-teslas-autopilot-from-consumer-reports-as-wall-street-turns-bearish

Inside view-source there's a script in a json format that contains the data that I needed

Something like this, I crop the other text to minimize the space

<script>
      window.__RELAY_STORE__ = {"public_at":"2019-05-22T11:02:43- 
04:00","updated_at":"2019-05-22T15:25:20- 
04:00","thumbnail_attribution":null,"body":null,"title":"Safety Concerns 
Over Tesla's Autopilot from Consumer Reports as Wall Street Turns Bearish"
</script>

I just only need to get the "public_at" and the "title"

And What I have tried is this,

data = response.xpath("//script[contains(., 'window.__RELAY_STORE__')]/text()")
#Locate the script

datatxt = data.extract_first()
#Extract the script

start = datatxt.find('client:') - 2
end = datatxt.find('window.__REDUX_STATE__')
# find start and end of data 

json_string = datatxt[start:end]

but when I load it or convert it to python dictionary

 data = json.loads(json_string)

I've got an error something like this

Extra data: line 1 column 27284 (char 27283)

Any idea how can I get those data please?


回答1:


Try to get data in this way:

txt = response.xpath("//script[contains(., 'window.__RELAY_STORE__')]/text()").re_first('window.__RELAY_STORE__ = (.*);')

This will crop name of js-variable and last ;. So then when I call json.loads(txt) it gives me valid json.



来源:https://stackoverflow.com/questions/56294152/how-to-extract-data-from-javascript-in-a-json-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!