问题
I am scraping a website with urls http://domain.com/post/X , where X is a number stating from 1:5000
I can scrap using rvest using this code:
website <- html("http://www.domain.com/post/1")
Name <- website%>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
html_text()
Speciality <- website %>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
html_text()
I need the code to grab all the pages from the website and put the scarped data in a table with every page in a new row. Please Help
回答1:
I would wrap your code for scraping a single page in an lapply, and then use rbindlist from the data.table package to combine the information from each page.
This is hard to test without an actual example, but try something like this:
library(rvest)
library(data.table)
scrapeDomain <- function(baseURL="http://www.domain.com/post", index=1:10) {
scrape1 <- lapply(index, function(n) {
website <- paste(baseURL, n, sep="/") %>%
html()
name <- website %>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
html_text()
speciality <- website %>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
html_text()
data.table(website=website, name=name, specialty=specialty)
} )
rbindlist(scrape1)
}
scrapeDomain()
来源:https://stackoverflow.com/questions/27312728/r-and-web-scraping-with-looping