R: using rvest and purrr:map_df to build a data frame: how to deal with incomplete input [duplicate]

问题

I am webscraping webpages with rvest and turning the collected data into a dataframe using purrr::map_df. The problem I ran into is that not all webpages have content on every html_nodes that I specify, and map_df is ignoring such incomplete webpages. I would want map_df to include said webpages and write NA wherever a html_nodes does not match content. Take the following code:

library(rvest)
library(tidyverse)

urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome", 
             "https://es.wikipedia.org/wiki/Curic%C3%B3")
h <- urls %>% map(read_html)

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()
  df <- tibble(a, b)
})
out

Here is the output:

> out
# A tibble: 2 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History

The problem here is that the output dataframe does not contain rows for websites which have not match for the #History html node (in this case, the third url). My desired output, looks like this:

> out
# A tibble: 2 x 3
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA

Any help will be greatly appreciated!

回答1:

You can just check in the map_df portion. Since html_nodes returns character(0) when it's not there, check the lengths of a and b

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()

  a <- ifelse(length(a) == 0, NA, a)
  b <- ifelse(length(b) == 0, NA, b)

  df <- tibble(a, b)
})
out

# A tibble: 3 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA

来源：https://stackoverflow.com/questions/55961475/r-using-rvest-and-purrrmap-df-to-build-a-data-frame-how-to-deal-with-incomple

标签

rvest

purrr