Submit URLs from a data frame column using rvest

試著忘記壹切 提交于 2019-12-24 13:41:23

问题


I have a data frame called dogs that looks like this:

url 
https://en.wikipedia.org/wiki/Dog
https://en.wikipedia.org/wiki/Dingo
https://en.wikipedia.org/wiki/Canis_lupus_dingo

I would like to submit all the urls to rvest but I am not sure how to

I tried this

dogstext <-html(dogs$url) %>%
    html_nodes("p:nth-child(4)") %>%
    html_text() 

but i got this error

Error in UseMethod("parse") : 
  no applicable method for 'parse' applied to an object of class "factor"

回答1:


As the error says, you need to convert factor column into character before parsing:

dogs$url<-as.character(dogs$url)

and then your code follows after this.

Update:

dog<-data.frame(url=c("https://en.wikipedia.org/wiki/Dog","https://en.wikipedia.org/wiki/Dingo","https://en.wikipedia.org/wiki/Canis_lupus_dingo"))
> str(dog)
'data.frame':   3 obs. of  1 variable:
 $ url: Factor w/ 3 levels "https://en.wikipedia.org/wiki/Canis_lupus_dingo",..: 3 2 1
> lapply(as.character(dog$url),function(i)dogstext <-html(i) %>%
          html_nodes("p:nth-child(4)") %>%
            html_text() )
[[1]]
[1] "The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated canid which has been selectively bred for millennia for various behaviors, sensory capabilities, and physical attributes.[2] The global dog population is estimated to between 700 million[3] to over one billion, thus making the dog the most abundant member of order Carnivora.[4]"

[[2]]
[1] "The dingo's habitat ranges from deserts to grasslands and the edges of forests. Dingoes will normally make their dens in deserted rabbit holes and hollow logs close to an essential supply of water."

[[3]]
character(0)



回答2:


You can also keep with the piping (%>%) idiom all the way through and (if needed) append a column with the extracted text back to the original data frame or keep it as a vector. The method below also make the code a bit more readable.

library(rvest)
library(dplyr)

dog <- data.frame(url=c("https://en.wikipedia.org/wiki/Dog",
                        "https://en.wikipedia.org/wiki/Dingo",
                        "https://en.wikipedia.org/wiki/Canis_lupus_dingo"))

# this keeps the code clean and readable and testable

extract <- function(x, css) {

  # this catches retrieval errors

  pg <- try(html(x), silent=TRUE)

  # if any retrieval error, return NA

  if (inherits(pg, "try-error")) { return(NA) }

  pg %>% 
    html_nodes(css) %>%
    html_text -> element

  # if there is no matching element the resule will be a 0 length list
  # which will prevent sapply from simplifying it, so test for that here

  element <- ifelse(length(element) == 0, NA, element)

  element

}

# add as a column to the original data frame

dog %>% mutate(text=sapply(as.character(url), extract, "p:nth-child(4)")) -> dog

glimpse(dog)

## Observations: 3
## Variables:
## $ url  (fctr) https://en.wikipedia.org/wiki/Dog, https://en.wikipedia....
## $ text (chr) "The domestic dog (Canis lupus familiaris or Canis famili...

# or just get it out as a separate vector

dog$url %>%
  as.character %>%
  sapply(extract, "p:nth-child(4)")

##                                                                                                                                                                                                                                                                                                                                        https://en.wikipedia.org/wiki/Dog 
## "The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated canid which has been selectively bred for millennia for various behaviors, sensory capabilities, and physical attributes.[2] The global dog population is estimated to between 700 million[3] to over one billion, thus making the dog the most abundant member of order Carnivora.[4]" 
##                                                                                                                                                                                                                                                                                                                                      https://en.wikipedia.org/wiki/Dingo 
##                                                                                                                                                                  "The dingo's habitat ranges from deserts to grasslands and the edges of forests. Dingoes will normally make their dens in deserted rabbit holes and hollow logs close to an essential supply of water." 
##                                                                                                                                                                                                                                                                                                                          https://en.wikipedia.org/wiki/Canis_lupus_dingo 
##                                                                                                                                                                                                                                                                                                                                                                       NA


来源:https://stackoverflow.com/questions/30586480/submit-urls-from-a-data-frame-column-using-rvest

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!