R and Web Scraping with looping

问题

I am scraping a website with urls http://domain.com/post/X , where X is a number stating from 1:5000 I can scrap using rvest using this code:

website <- html("http://www.domain.com/post/1")

Name <- website%>% 
  html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
  html_text()

Speciality <- website %>% 
  html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
  html_text()

I need the code to grab all the pages from the website and put the scarped data in a table with every page in a new row. Please Help

回答1:

I would wrap your code for scraping a single page in an lapply, and then use rbindlist from the data.table package to combine the information from each page.

This is hard to test without an actual example, but try something like this:

library(rvest)
library(data.table)

scrapeDomain <- function(baseURL="http://www.domain.com/post", index=1:10) {

  scrape1 <- lapply(index, function(n) {

    website <- paste(baseURL, n, sep="/") %>%
      html()

    name <- website %>% 
      html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
      html_text()

    speciality <- website %>% 
      html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
      html_text()

    data.table(website=website, name=name, specialty=specialty)

  } )

  rbindlist(scrape1)

}

scrapeDomain()

来源：https://stackoverflow.com/questions/27312728/r-and-web-scraping-with-looping

标签

web-scraping

rvest