Scrape data from flash page using rvest

浪尽此生 提交于 2019-12-12 04:30:44

问题


I am trying to scrape data from this page:

http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?

If I try to scrape the name of the players using the css selector and the usual rvest syntax:

names <- read_html("http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?") %>% 
html_nodes(".scoring-player-name") %>% sapply(html_text)

everything goes well.

Unfortunately if I try to scrape the statistics below (first serve pts won, ..) using the selector .stat-breakdown span I am not able to retrieve any data.

I know rvest is generally not recommended to scrape pages created dynamically, however I don't understand why some data are scraped and some not.


回答1:


I don't use Rvest. If you follow the code below you should get to the format which is in the picture basically a string which you could transform to dataframe based on separators :, .

This Tag also contains more information than it was displayed in UI of webpage. I can try also RSelenium but need to get my other PC. So I would let you know if RSelenium worked for me.

library(XML)
library(RCurl)
library(stringr)

url<-"http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?"
url2<-getURL(url)
parsed<-htmlParse(url2)
# get messi data from tag
step1<-xpathSApply(parsed,"//script[@id='matchStatsData']",xmlValue)
# removing some unwanted characters
step2<-str_replace_all(step1,"\r\n","")
step3<-str_replace_all(step2,"\t","")
step4<-str_replace_all(step3,"[[{}]\"]","")

Output then is a string like this



来源:https://stackoverflow.com/questions/37643968/scrape-data-from-flash-page-using-rvest

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!