问题
I have a tibble containing one column which stores hyperlinks in each column. Now I want to map over these links using map_dfr, passing the links one after another through read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>% html_text
. If I do so I always end up with the error :
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Expecting a single string value: [type=character; extent=3].
Which tells me that the read_html basically says: " Hey stop throwing more than one string at the same time on me."
So did I make a mistake in the mapper? Is this a bug? I really can't see why the mapper-function does not grab each element one after another.
What I tried so far :
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- map(".read-more", ~html_nodes(bribe, .x) %>%
html_attr(.x[[1]][[1]][[1]], name = "href"))[[1]] %>%
as_tibble(.name_repair = "unique") %>%
bind_rows() %>%
rename( ...1 = value) %>%
adverts() %>%
map_dfr(~read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>%
html_text)
Do not mind the call of rename()
which is basically something what needed to be done to make the adverts
usable in this case.
回答1:
You're forgetting that most functions in R are vectorized, and that using map
or apply
functions is unnecessary. In your case, it is needed in the final step of getting the html text.
The syntax your are using in map
is also puzzling, and I think you should review ?map
to get a better handle on it. For instance, you use multiple .x
or extracted values where you should just be using .x
to refer to the sub-element of the object you are iterating over.
library(tidyverse)
library(rvest)
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- html_nodes(bribe, ".read-more") %>%
html_attr("href") %>%
as_tibble(.name_repair = "unique") %>%
filter(str_detect(value, target_regex, negate = TRUE)) %>%
mutate(text = map_chr(value, ~read_html(.x) %>%
html_node(".body-copy-lg") %>%
html_text))
result
# A tibble: 3 x 2
value text
<chr> <chr>
1 http://ipaidabribe.com/reports/paid/paid-bribe-to-settle-matter… "\r\n Place: Nelamangala Police Station, Bangalore\nDate of incident: 5th Jan 2020, 3PM…
2 http://ipaidabribe.com/reports/paid/paid-500-rs-bribe-at-nizamu… "\r\n My Brother Mahesh Prasad travelling on PNR number 4822171124 train no 12721 Ni…
3 http://ipaidabribe.com/reports/paid/drone-air-follow-focus-wire… "\r\n This new Silencer Air+ is a tremendously versatile and resourceful follow focus, z…
来源:https://stackoverflow.com/questions/59789055/map-a-tbl-of-hyperlinks-into-read-html