R and Web Scraping with looping

纵饮孤独 提交于 2020-12-27 07:15:55

问题


I am scraping a website with urls http://domain.com/post/X , where X is a number stating from 1:5000 I can scrap using rvest using this code:

website <- html("http://www.domain.com/post/1")

Name <- website%>% 
  html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
  html_text()

Speciality <- website %>% 
  html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
  html_text()

I need the code to grab all the pages from the website and put the scarped data in a table with every page in a new row. Please Help


回答1:


I would wrap your code for scraping a single page in an lapply, and then use rbindlist from the data.table package to combine the information from each page.

This is hard to test without an actual example, but try something like this:

library(rvest)
library(data.table)

scrapeDomain <- function(baseURL="http://www.domain.com/post", index=1:10) {

  scrape1 <- lapply(index, function(n) {

    website <- paste(baseURL, n, sep="/") %>%
      html()

    name <- website %>% 
      html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
      html_text()

    speciality <- website %>% 
      html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
      html_text()

    data.table(website=website, name=name, specialty=specialty)

  } )

  rbindlist(scrape1)

}

scrapeDomain()


来源:https://stackoverflow.com/questions/27312728/r-and-web-scraping-with-looping

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!