R web scraping across multiple pages

后端 未结 2 638
再見小時候
再見小時候 2020-12-01 05:50

I am working on a web scraping program to search for specific wines and return a list of local wines of that variety. The problem I am having is multiple page results. The c

相关标签:
2条回答
  • 2020-12-01 06:17

    You can do something similar with purrr::map_df() as well if you want all the info as a data.frame:

    library(rvest)
    library(purrr)
    
    url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"
    
    map_df(1:39, function(i) {
    
      # simple but effective progress indicator
      cat(".")
    
      pg <- read_html(sprintf(url_base, i))
    
      data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
                 excerpt=html_text(html_nodes(pg, "div.excerpt")),
                 rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
                 appellation=html_text(html_nodes(pg, "span.appellation")),
                 price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
                 stringsAsFactors=FALSE)
    
    }) -> wines
    
    dplyr::glimpse(wines)
    ## Observations: 1,170
    ## Variables: 5
    ## $ wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
    ## $ excerpt     (chr) "Green olive, green stem and fresh herb aromas are at the ...
    ## $ rating      (chr) "96", "95", "94", "93", "93", "93", "93", "93", "93", "93"...
    ## $ appellation (chr) "Columbia Valley", "Columbia Valley", "Columbia Valley", "...
    ## $ price       (chr) "140", "70", "70", "20", "70", "40", "135", "50", "60", "3...
    
    0 讨论(0)
  • 2020-12-01 06:33

    You can lapply across a vector of the URLs, which you can make by pasting the base URL to a sequence:

    library(rvest)
    
    wines <- lapply(paste0('http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=', 1:39),
                    function(url){
                        url %>% read_html() %>% 
                            html_nodes(".review-listing .title") %>% 
                            html_text()
                    })
    

    The result will be returned in a list with an element for each page.

    0 讨论(0)
提交回复
热议问题