I'm trying to scrape this website:
through the rvest
package in R.
Unfortunately it seems that rvest
doesn't recognize the nodes through the CSS selector.
For example if I try to extract the information in the header of every table (Grade, Prize, Distance), whose CSS selector is ".black" and I run this code:
URL <- read_html("http://www.racingpost.com/greyhounds/result_home.sd#resultDay=2015-12-26&meetingId=18&isFullMeeting=true")
nodes<-html_nodes(URL, ".black")
nodes comes out to be a null list, so it's not scraping anything.
It's making an XHR request to generate the HTML. Try this (which should also make it easier to automate the data capture):
library(httr)
library(xml2)
library(rvest)
res <- GET("http://www.racingpost.com/greyhounds/result_by_meeting_full.sd",
query=list(r_date="2015-12-26",
meeting_id=18))
doc <- read_html(content(res, as="text"))
html_nodes(doc, ".black")
## {xml_nodeset (56)}
## [1] <span class="black">A9</span>
## [2] <span class="black">£61</span>
## [3] <span class="black">470m</span>
## [4] <span class="black">-30</span>
## [5] <span class="black">H2</span>
## [6] <span class="black">£105</span>
## [7] <span class="black">470m</span>
## [8] <span class="black">-30</span>
## [9] <span class="black">A7</span>
## [10] <span class="black">£61</span>
## [11] <span class="black">470m</span>
## [12] <span class="black">-30</span>
## [13] <span class="black">A5</span>
## [14] <span class="black">£66</span>
## [15] <span class="black">470m</span>
## [16] <span class="black">-30</span>
## [17] <span class="black">A8</span>
## [18] <span class="black">£61</span>
## [19] <span class="black">470m</span>
## [20] <span class="black">-20</span>
## ...
Your selector is good and rvest
is working just fine. The problem is that what you are looking for is not in url
object.
If you open that website and use web browser inspecting tool, you will see that all data you want is descendant of <div id="resultMainOutput">
. Now if you look up source code of this website, you will this (line-breaks added for readability):
<div id="resultMainOutput">
<div class="wait">
<img src="http://ui.racingpost.com/img/all/loading.gif" alt="Loading..." />
</div>
</div>
Data you want is loaded dynamically and rvest
is not able to cope with that. It can only fetch website source code and retrieve anything that there is without any client-side processing.
The exact same issue was brought up in rvest-introducing blog post and here is what package author had to say:
You have two options for pages like that:
Use the debug console in the web browser to reverse engineer the communications protocol and request the raw data directly from the server.
Use a package like RSelenium to automate a web browser.
If you don't need to obtain that data repeatedly, or you can accept a bit of manual work in every analysis, the easiest workaround is:
- Open website in web browser of choice
- Using web browser inspecting tool, copy current website content (entire page or only
<div id="resultMainOutput">
content) - Paste that thing into text editor and save it as new file
- Run analysis on that file
> url <- read_html("/tmp/racingpost.html")
> html_nodes(url, ".black")
# {xml_nodeset (56)}
# [1] <span class="black">A9</span>
# [2] <span class="black">£61</span>
# [3] <span class="black">470m</span>
# [4] <span class="black">-30</span>
# (skip the rest)
As you can see, there are some encoding issues along the way, but they can be solved later on.
来源:https://stackoverflow.com/questions/34473847/rvest-not-recognizing-css-selector