rvest vs RSelenium results for text extracting

梦想与她 提交于 2019-12-06 15:04:28

问题


So far i am using RSelenium to extract the text of a Homepage, but i would like to Switch to a fast solution like rvest.

library(rvest)
url = 'https://www.r-bloggers.com'
rvestResults <- read_html(url) %>%
  html_node('body') %>%
  html_text()

library(RSelenium)
remDr$navigate(url)
rSelResults <- remDr$findElement(
  using = "xpath",
  value = "//body"
)$getElementText()

Comparing the results below Shows that rvest includes some JavaScript Code, while the RSelenium is much "cleaner".

I am aware of the differences between rvest and rselenium, that rselenium uses a headless browser and rvest just reads the "plain Homepage".

My question would be: Is there a way i can get the Rselenium Output below with rvest or equally fast (or faster) as/than rvest with a third way?

Rvest results:

> substring(rvestResults, 1, 500)
[1] "\n\n\n\t\t    \t    \t\n        \n        R news and tutorials contributed by (750) R bloggers         \n    Home\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\nSubmit a new job (it’s free)\n\tBrowse latest jobs (also free)\n\nContact us\n\n\n\n\n\n\n\n    \n\t\tWelcome!
     \t\t\t\r\nfunction init() {\r\nvar vidDefer = document.getElementsByTagName('iframe');\r\nfor (var i=0; i<vidDefer.length; i++) {\r\nif(vidDefer[i].getAttribute('data-src')) 
     {\r\nvidDefer[i].setAttribute('src',vidDefer[i].getAttribute('data-src'));\r\n} } }\r\nwindow.onload = i"

RSelenium results:

> substring(rSelResults, 1, 500)
[1] "R news and tutorials contributed by (750) R bloggers\nHome\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\n�\n�\n�\nContact us\nWELCOME!\nHere you will find daily news and tutorials about R, 
     contributed by over 750 bloggers.\nThere are many ways to follow us -\nBy e-mail:\nOn Facebook:\nIf you are an R blogger yourself you are invited to add your own R content feed to this site (Non-English 
     R bloggers should add themselves- here)\nJOBS FOR R-USERS\nData/GIS Analyst for Ecoscape Environmental Consultants @ Kelowna, "

回答1:


Maybe webdriver, which is a PhantomJS implementation, would do a better job (can't test against RSelenium at the moment):

library("webdriver")
library("rvest")

pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
url <- 'https://www.r-bloggers.com'
ses$go(url)

res <- ses$getSource() %>% 
  read_html() %>%
  html_node('body') %>%
  html_text()

substring(res, 1, 500)
#> [1] "\n\n\n\t\t    \t    \t\n        \n        R news and tutorials contributed by (750) R bloggers         \n    Home\nAbout\nRSS\nadd your blog!\nLearn R\nR jobs\nSubmit a new job (it’s free)\n\tBrowse latest jobs (also free)\n\nContact us\n\n\n\n\n\n\n\n    \n\t\tWelcome!\t\t\t\n\n\n\n\nHere you will find daily news and tutorials about R, contributed by over 750 bloggers. \n\nThere are many ways to follow us - \nBy e-mail:\n\n\n<img src=\"https://feeds.feedburner.com/~fc/RBloggers?bg=99CCFF&amp;fg=444444&amp;anim=0\" height=\"26\" width=\"88\" sty"



回答2:


You can try regex to clean up your data,

url <- "https://www.r-bloggers.com"

res <- url %>% 
  read_html() %>% 
  html_nodes('body') %>%
  html_text()

library(stringr)

# clean up text data
res %>%
  str_replace_all(pattern = "\n", replacement = " ") %>%
  str_replace_all(pattern = "[\\^]", replacement = " ") %>%
  str_replace_all(pattern = "\"", replacement = " ") %>%
  str_replace_all(pattern = "\\s+", replacement = " ") %>%
  str_trim(side = "both")


来源:https://stackoverflow.com/questions/56857535/rvest-vs-rselenium-results-for-text-extracting

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!