Rvest scraping errors

て烟熏妆下的殇ゞ 提交于 2019-12-04 18:38:51

There are two problems with your code. Look here for examples on how to use the package.

1. You cannot just use every function with everything.

  • html() is for download of content
  • html_node() is for selecting node(s) from the downloaded content of a page
  • html_text() is for extracting text from a previously selected node

Therefore, to download one of your pages and extract the text of the html-node, use this:

library(rvest)

old-school style:

url          <- "https://github.com/rails/rails/pull/100"
url_content  <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text

... or this ...

hard to read old-school style:

url_mainnode_text  <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text

... or this ...

magritr-piping style

url_mainnode_text  <- 
  html("https://github.com/rails/rails/pull/100") %>%
  html_node("*") %>%
  html_text()
url_mainnode_text

2. When using lists you have to apply functions to the list with e.g. lapply()

If you want to kind of batch-process several URLs you can try something like this:

  url_list    <- c("https://github.com/rails/rails/pull/100", 
                   "https://github.com/rails/rails/pull/200", 
                   "https://github.com/rails/rails/pull/300")

  get_html_text <- function(url, css_or_xpath="*"){
      html_text(
        html_node(
          html("https://github.com/rails/rails/pull/100"), css_or_xpath
        )
      )
   }

lapply(url_list, get_html_text, css_or_xpath="a[class=message]")

You need to use html_nodes() and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200

rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()

[1] "jaw6"      "jaw6"      "josevalim"
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!