问题
I have a data frame called dogs
that looks like this:
url
https://en.wikipedia.org/wiki/Dog
https://en.wikipedia.org/wiki/Dingo
https://en.wikipedia.org/wiki/Canis_lupus_dingo
I would like to submit all the urls to rvest but I am not sure how to
I tried this
dogstext <-html(dogs$url) %>%
html_nodes("p:nth-child(4)") %>%
html_text()
but i got this error
Error in UseMethod("parse") :
no applicable method for 'parse' applied to an object of class "factor"
回答1:
As the error says, you need to convert factor column into character before parsing:
dogs$url<-as.character(dogs$url)
and then your code follows after this.
Update:
dog<-data.frame(url=c("https://en.wikipedia.org/wiki/Dog","https://en.wikipedia.org/wiki/Dingo","https://en.wikipedia.org/wiki/Canis_lupus_dingo"))
> str(dog)
'data.frame': 3 obs. of 1 variable:
$ url: Factor w/ 3 levels "https://en.wikipedia.org/wiki/Canis_lupus_dingo",..: 3 2 1
> lapply(as.character(dog$url),function(i)dogstext <-html(i) %>%
html_nodes("p:nth-child(4)") %>%
html_text() )
[[1]]
[1] "The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated canid which has been selectively bred for millennia for various behaviors, sensory capabilities, and physical attributes.[2] The global dog population is estimated to between 700 million[3] to over one billion, thus making the dog the most abundant member of order Carnivora.[4]"
[[2]]
[1] "The dingo's habitat ranges from deserts to grasslands and the edges of forests. Dingoes will normally make their dens in deserted rabbit holes and hollow logs close to an essential supply of water."
[[3]]
character(0)
回答2:
You can also keep with the piping (%>%
) idiom all the way through and (if needed) append a column with the extracted text back to the original data frame or keep it as a vector. The method below also make the code a bit more readable.
library(rvest)
library(dplyr)
dog <- data.frame(url=c("https://en.wikipedia.org/wiki/Dog",
"https://en.wikipedia.org/wiki/Dingo",
"https://en.wikipedia.org/wiki/Canis_lupus_dingo"))
# this keeps the code clean and readable and testable
extract <- function(x, css) {
# this catches retrieval errors
pg <- try(html(x), silent=TRUE)
# if any retrieval error, return NA
if (inherits(pg, "try-error")) { return(NA) }
pg %>%
html_nodes(css) %>%
html_text -> element
# if there is no matching element the resule will be a 0 length list
# which will prevent sapply from simplifying it, so test for that here
element <- ifelse(length(element) == 0, NA, element)
element
}
# add as a column to the original data frame
dog %>% mutate(text=sapply(as.character(url), extract, "p:nth-child(4)")) -> dog
glimpse(dog)
## Observations: 3
## Variables:
## $ url (fctr) https://en.wikipedia.org/wiki/Dog, https://en.wikipedia....
## $ text (chr) "The domestic dog (Canis lupus familiaris or Canis famili...
# or just get it out as a separate vector
dog$url %>%
as.character %>%
sapply(extract, "p:nth-child(4)")
## https://en.wikipedia.org/wiki/Dog
## "The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated canid which has been selectively bred for millennia for various behaviors, sensory capabilities, and physical attributes.[2] The global dog population is estimated to between 700 million[3] to over one billion, thus making the dog the most abundant member of order Carnivora.[4]"
## https://en.wikipedia.org/wiki/Dingo
## "The dingo's habitat ranges from deserts to grasslands and the edges of forests. Dingoes will normally make their dens in deserted rabbit holes and hollow logs close to an essential supply of water."
## https://en.wikipedia.org/wiki/Canis_lupus_dingo
## NA
来源:https://stackoverflow.com/questions/30586480/submit-urls-from-a-data-frame-column-using-rvest