Rvest: Scrape multiple URLs

前端 未结 3 1694
悲哀的现实
悲哀的现实 2020-12-09 14:22

I am trying to scrape some IMDB data looping through a list of URLs. Unfortunately my output isn\'t exactly what I hoped for, never mind storing it in a dataframe.

相关标签:
3条回答
  • 2020-12-09 14:35

    Here's one approach using purrr and rvest. The key idea is to save the parsed page, and then extract the bits you're interested in.

    library(rvest)
    library(purrr)
    
    topmovies <- read_html("http://www.imdb.com/chart/top")
    links <- topmovies %>%
      html_nodes(".titleColumn") %>%
      html_nodes("a") %>%
      html_attr("href") %>% 
      xml2::url_absolute("http://imdb.com") %>% 
      .[1:5] # for testing
    
    pages <- links %>% map(read_html)
    
    title <- pages %>% 
      map_chr(. %>% 
        html_nodes("h1") %>% 
        html_text()
      )
    rating <- pages %>% 
      map_dbl(. %>% 
        html_nodes("strong span") %>% 
        html_text() %>% 
        as.numeric()
      )
    
    0 讨论(0)
  • 2020-12-09 14:49

    Edit: now with rating as well

    library(dplyr)
    library(rvest)
    
    s = "http://www.imdb.com/chart/top" %>% html_session
    
    links =
      s %>%
      html_nodes(".titleColumn a") %>%
      html_attr("href") %>%
      data_frame(link = .) %>%
      slice(1:10) %>%
      rowwise %>%
      mutate(new_page = 
               s %>%
               jump_to(link) %>%
               list,
             title = 
               new_page %>%
               html_nodes("h1") %>% 
               html_text,
             rating = 
               new_page %>%
               html_nodes("strong span") %>% 
               html_text %>%
               as.numeric)
    
    0 讨论(0)
  • 2020-12-09 14:55

    Another approach would be to use sapply as follows:

    library(rvest)
    
    s = "http://www.imdb.com/chart/top" %>% html_session
    
    title_links <- function(x) {x %>% html_nodes(".titleColumn a") %>% html_attr("href")}
    h1_text <- function(x) {x %>% html_node("h1") %>% html_text(trim=TRUE)}
    
    s %>% 
      title_links %>% 
      sapply(. %>% jump_to(s, .) %>% h1_text) %>% 
      data.frame(text = ., link = names(.), row.names=NULL)
    

    Which results in:

                         text
    1 Die Verurteilten (1994)
    2         Der Pate (1972)
    3       Der Pate 2 (1974)
    4  The Dark Knight (2008)
    5 Schindlers Liste (1993)
                                                                                                                                                     link
    1 /title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=109KYN8J6HW5TM5Y1P86&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1
    2 /title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=109KYN8J6HW5TM5Y1P86&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2
    3 /title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=109KYN8J6HW5TM5Y1P86&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3
    4 /title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=109KYN8J6HW5TM5Y1P86&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_4
    5 /title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=109KYN8J6HW5TM5Y1P86&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_5
    
    0 讨论(0)
提交回复
热议问题