问题
I am scraping a website with urls http://domain.com/post/X
, where X is a number stating from 1:5000
I can scrap using rvest
using this code:
website <- html("http://www.domain.com/post/1")
Name <- website%>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
html_text()
Speciality <- website %>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
html_text()
I need the code to grab all the pages from the website and put the scarped data in a table with every page in a new row. Please Help
回答1:
I would wrap your code for scraping a single page in an lapply
, and then use rbindlist
from the data.table
package to combine the information from each page.
This is hard to test without an actual example, but try something like this:
library(rvest)
library(data.table)
scrapeDomain <- function(baseURL="http://www.domain.com/post", index=1:10) {
scrape1 <- lapply(index, function(n) {
website <- paste(baseURL, n, sep="/") %>%
html()
name <- website %>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > h1") %>%
html_text()
speciality <- website %>%
html_node("body > div > div.row-fluid > div > div.DrFullDetails > div.MainDetails > div.Description > p.JobTitle") %>%
html_text()
data.table(website=website, name=name, specialty=specialty)
} )
rbindlist(scrape1)
}
scrapeDomain()
来源:https://stackoverflow.com/questions/27312728/r-and-web-scraping-with-looping