问题
I am webscraping webpages with rvest
and turning the collected data into a dataframe using purrr::map_df
. The problem I ran into is that not all webpages have content on every html_nodes
that I specify, and map_df
is ignoring such incomplete webpages. I would want map_df
to include said webpages and write NA
wherever a html_nodes
does not match content. Take the following code:
library(rvest)
library(tidyverse)
urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
"https://en.wikipedia.org/wiki/Rome",
"https://es.wikipedia.org/wiki/Curic%C3%B3")
h <- urls %>% map(read_html)
out <- h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
b <- html_nodes(., "#History") %>% html_text()
df <- tibble(a, b)
})
out
Here is the output:
> out
# A tibble: 2 x 2
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
The problem here is that the output dataframe does not contain rows for websites which have not match for the #History
html node (in this case, the third url). My desired output, looks like this:
> out
# A tibble: 2 x 3
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
3 Curicó NA
Any help will be greatly appreciated!
回答1:
You can just check in the map_df
portion. Since html_nodes
returns character(0)
when it's not there, check the lengths of a
and b
out <- h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
b <- html_nodes(., "#History") %>% html_text()
a <- ifelse(length(a) == 0, NA, a)
b <- ifelse(length(b) == 0, NA, b)
df <- tibble(a, b)
})
out
# A tibble: 3 x 2
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
3 Curicó NA
来源:https://stackoverflow.com/questions/55961475/r-using-rvest-and-purrrmap-df-to-build-a-data-frame-how-to-deal-with-incomple