Scraping pages with inconsistent lengths in dataframe

两盒软妹~` 提交于 2020-02-25 13:14:27

问题


I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:

 Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`

How can I let my code run but fill with Na's in tibble if the data isn't there.

My code for a pauzing robot later used in scraper function:

pauzing_robot <- function (periods = c(0, 1)) {
      tictoc <- runif(1, periods[1], periods[2])
      cat(paste0(Sys.time()), 
          "- Sleeping for ", round(tictoc, 2), "seconds\n")
      Sys.sleep(tictoc)
    }

Scraper:

library(tidyverse)
library(rvest)

scrape_page <- function(pagina_nummer) {

  page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer)) 

  pauzing_robot(periods = c(0, 1.5))

  tibble(

    huisarts = page %>% 
      html_nodes(".media-heading.title.orange") %>% 
      html_text() %>% 
      str_trim(), 

    praktijk = page %>% 
      html_nodes(".location") %>% 
      html_text() %>%
      str_trim(),

    url = page %>% 
      html_nodes(".media-heading.title.orange") %>% 
      html_nodes("a") %>%
      html_attr("href") %>% 
      str_trim() %>% 
      paste0("https://www.zorgkaartnederland.nl", .)
  )
}

Total number of pages 445, but for example sake only scraping three:

huisartsen <- map_df(sample(1:3), scrape_page)

Page 2 seems to be the problem with inconsistent lengths because this code works:

huisartsen <- map_df(3:4, scrape_page)

If possible with tidyverse code. Thanks in advance.


回答1:


You need to retrieve the list of parent nodes

parents <- page %>% html_nodes("li.media")

Then parse the parent nodes with function html_node().

tibble(
    huisarts = parents %>% 
      html_node(".media-heading.title.orange") %>% 
      html_text() %>% 
      str_trim(), 

    praktijk = parents %>% 
      html_node(".location") %>% 
      html_text() %>%
      str_trim(),

    url = parents %>% 
      html_node(".media-heading.title.orange a") %>% 
      html_attr("href") %>% 
      str_trim() %>% 
      paste0("https://www.zorgkaartnederland.nl", .)
  ) 

The html_node function will always return a value even if it is just a NA



来源:https://stackoverflow.com/questions/59614978/scraping-pages-with-inconsistent-lengths-in-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!