问题
I am trying to screen scrape tennis results data (point by point data, not just final result) from this page using R.
http://www.scoreboard.com/au/match/wang-j-karlovic-i-2014/M1mWYtEF/#point-by-point;1
Using the regular R screen scraping functions like readlines(),htmlParseTree() etc I am able to scrape the source html for the page, but that does not contain the results data.
Is it possible to scrape all the text from the page, as if I were on the page in my browser and selected all and then copied?
回答1:
That data is loaded using AJAX from http://d.scoreboard.com/au/x/feed/d_mh_M1mWYtEF_en-au_1, so R will not be able to just load it for you. However, because both use the code M1mWYtEF
, you can go directly to the page that has the data you want. Using Chrome's devtools, I was able to see that the page sends a header of X-Fsign: SW9D1eZo
that will let you access that page (you get a 401 Unauthorized
error otherwise).
Here is R code for getting the html that holds the data you want from your example page:
library(httr)
page_code <- "M1mWYtEF"
linked_page <- paste0("http://d.scoreboard.com/au/x/feed/d_mh_",
page_code, "_en-au_1")
GET(linked_page, add_headers("X-Fsign" = "SW9D1eZo"))
来源:https://stackoverflow.com/questions/24835984/screen-scraping-actual-page-not-source-html-with-r