Map a tbl of hyperlinks into read_html

跟風遠走 提交于 2020-01-25 06:54:26

问题


I have a tibble containing one column which stores hyperlinks in each column. Now I want to map over these links using map_dfr, passing the links one after another through read_html(.x[.x]) %>% html_node(".body-copy-lg") %>% html_text. If I do so I always end up with the error :

Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Expecting a single string value: [type=character; extent=3].

Which tells me that the read_html basically says: " Hey stop throwing more than one string at the same time on me."

So did I make a mistake in the mapper? Is this a bug? I really can't see why the mapper-function does not grab each element one after another.

What I tried so far :

target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)| 
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"

adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]

bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))

report <- map(".read-more", ~html_nodes(bribe, .x) %>% 
    html_attr(.x[[1]][[1]][[1]], name = "href"))[[1]] %>% 
    as_tibble(.name_repair = "unique") %>% 
    bind_rows() %>% 
    rename( ...1 = value) %>% 
    adverts() %>%
    map_dfr(~read_html(.x[.x]) %>%  
    html_node(".body-copy-lg") %>% 
    html_text)

Do not mind the call of rename() which is basically something what needed to be done to make the adverts usable in this case.


回答1:


You're forgetting that most functions in R are vectorized, and that using map or apply functions is unnecessary. In your case, it is needed in the final step of getting the html text.

The syntax your are using in map is also puzzling, and I think you should review ?map to get a better handle on it. For instance, you use multiple .x or extracted values where you should just be using .x to refer to the sub-element of the object you are iterating over.

library(tidyverse)
library(rvest)

target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)| 
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"

adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]

bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))

report <- html_nodes(bribe, ".read-more") %>% 
  html_attr("href") %>% 
  as_tibble(.name_repair = "unique") %>% 
  filter(str_detect(value, target_regex, negate = TRUE)) %>% 
  mutate(text = map_chr(value, ~read_html(.x) %>%  
                          html_node(".body-copy-lg") %>% 
                          html_text))

result

# A tibble: 3 x 2
  value                                                            text                                                                                                        
  <chr>                                                            <chr>                                                                                                       
1 http://ipaidabribe.com/reports/paid/paid-bribe-to-settle-matter… "\r\n                    Place: Nelamangala Police Station, Bangalore\nDate of incident:  5th Jan 2020, 3PM…
2 http://ipaidabribe.com/reports/paid/paid-500-rs-bribe-at-nizamu… "\r\n                        My Brother Mahesh Prasad travelling on PNR number 4822171124 train no 12721 Ni…
3 http://ipaidabribe.com/reports/paid/drone-air-follow-focus-wire… "\r\n                    This new Silencer Air+ is a tremendously versatile and resourceful follow focus, z…


来源:https://stackoverflow.com/questions/59789055/map-a-tbl-of-hyperlinks-into-read-html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!