问题
As the title states, I'm curious if it is possible for the html_text()
function from the rvest
package to store an NA
value if it is not able to find an attribute on a specific page.
I'm currently running a scrape over 199 pages (which works fine; tested on a few variables already).
Currently, when I search for a value that is only present on a some (136) of the 199 pages, html_text()
is only returning a vector of 136 strings. This is not useful because without NA
s I am unable to determine which pages contained the variable in question.
I see that html_atts()
is able to receive a default
input, but not html_text()
. Any tips?
Thank you so much!
回答1:
If you create a new function to wrap error handling, it'll keep the %>%
pipe cleaner and easier to grok for your future self and others:
library(rvest)
html_text_na <- function(x, ...) {
txt <- try(html_text(x, ...))
if (inherits(txt, "try-error") |
(length(txt)==0)) { return(NA) }
return(txt)
}
base_url <- "http://www.saem.org/membership/services/residency-directory?RecordID=%d"
record_id <- c(1291, 1000, 1166, 1232, 999)
sapply(record_id, function(i) {
html(sprintf(base_url, i)) %>%
html_nodes("#drpict tr:nth-child(6) .text") %>%
html_text_na %>%
as.numeric()
})
## [1] 8 NA 10 27 NA
Also, by doing an sapply
over the vector of record_id
's you automagically get a vector back of whatever value that is you're trying to extract.
回答2:
Figured it out.
I just needed to add a line of logic to my loop.
Here's a chunk of the code that worked:
for(i in record_id) {
site <- paste("http://www.saem.org/membership/services/residency-directory?RecordID=", i, sep="")
site <- html(site)
this_data <- site %>%
html_nodes("#drpict tr:nth-child(6) .text") %>%
html_text() %>%
as.numeric()
if(length(this_data) == 0) {
this_data <- NA
}
all_data <- c(all_data, this_data)
}
Thanks anyway everybody (and @hrbrmstr)! :)
来源:https://stackoverflow.com/questions/30721519/rvest-package-is-it-possible-for-html-text-to-store-an-na-value-if-it-does-n