Rvest not recognizing css selector

问题

I'm trying to scrape this website:

http://www.racingpost.com/greyhounds/result_home.sd#resultDay=2015-12-26&meetingId=18&isFullMeeting=true

through the rvest package in R.

Unfortunately it seems that rvest doesn't recognize the nodes through the CSS selector.

For example if I try to extract the information in the header of every table (Grade, Prize, Distance), whose CSS selector is ".black" and I run this code:

URL <- read_html("http://www.racingpost.com/greyhounds/result_home.sd#resultDay=2015-12-26&meetingId=18&isFullMeeting=true")
nodes<-html_nodes(URL, ".black")

nodes comes out to be a null list, so it's not scraping anything.

回答1:

It's making an XHR request to generate the HTML. Try this (which should also make it easier to automate the data capture):

library(httr)
library(xml2)
library(rvest)

res <- GET("http://www.racingpost.com/greyhounds/result_by_meeting_full.sd",
           query=list(r_date="2015-12-26",
                      meeting_id=18))

doc <- read_html(content(res, as="text"))

html_nodes(doc, ".black")
## {xml_nodeset (56)}
##  [1] <span class="black">A9</span>
##  [2] <span class="black">£61</span>
##  [3] <span class="black">470m</span>
##  [4] <span class="black">-30</span>
##  [5] <span class="black">H2</span>
##  [6] <span class="black">£105</span>
##  [7] <span class="black">470m</span>
##  [8] <span class="black">-30</span>
##  [9] <span class="black">A7</span>
## [10] <span class="black">£61</span>
## [11] <span class="black">470m</span>
## [12] <span class="black">-30</span>
## [13] <span class="black">A5</span>
## [14] <span class="black">£66</span>
## [15] <span class="black">470m</span>
## [16] <span class="black">-30</span>
## [17] <span class="black">A8</span>
## [18] <span class="black">£61</span>
## [19] <span class="black">470m</span>
## [20] <span class="black">-20</span>
## ...

回答2:

Your selector is good and rvest is working just fine. The problem is that what you are looking for is not in url object.

If you open that website and use web browser inspecting tool, you will see that all data you want is descendant of <div id="resultMainOutput">. Now if you look up source code of this website, you will this (line-breaks added for readability):

<div id="resultMainOutput">
    <div class="wait">
       <img src="http://ui.racingpost.com/img/all/loading.gif" alt="Loading..." />
    </div>
</div>

Data you want is loaded dynamically and rvest is not able to cope with that. It can only fetch website source code and retrieve anything that there is without any client-side processing.

The exact same issue was brought up in rvest-introducing blog post and here is what package author had to say:

You have two options for pages like that:

Use the debug console in the web browser to reverse engineer the communications protocol and request the raw data directly from the server.

Use a package like RSelenium to automate a web browser.

If you don't need to obtain that data repeatedly, or you can accept a bit of manual work in every analysis, the easiest workaround is:

Open website in web browser of choice
Using web browser inspecting tool, copy current website content (entire page or only <div id="resultMainOutput"> content)
Paste that thing into text editor and save it as new file
Run analysis on that file

> url <- read_html("/tmp/racingpost.html")
> html_nodes(url, ".black")
# {xml_nodeset (56)}
# [1] <span class="black">A9</span>
# [2] <span class="black">Â£61</span>
# [3] <span class="black">470m</span>
# [4] <span class="black">-30</span>
# (skip the rest)

As you can see, there are some encoding issues along the way, but they can be solved later on.

来源：https://stackoverflow.com/questions/34473847/rvest-not-recognizing-css-selector

标签

web-scraping

rvest