R: Using rvest package instead of XML package to get links from URL

谁说胖子不能爱 提交于 2019-12-02 21:01:01

Despite my comment, here's how you can do it with rvest. Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses rvest into a tizzy.

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  
## [275] "/inf_corporativa98959_ZNC.html"  

That further illustrates rvest's XML package underpinnings.

UPDATE

rvest::read_html() can handle this directly now:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

I know you're looking for an rvest answer, but here's another way using the XML package that might be more efficient than what you're doing.

Have you seen the getLinks() function in example(htmlParse)? I use this modified version from the examples to get href links. It's a handler function so we can collect the values as they are read, saving on memory and increasing efficiency.

links <- function(URL) 
{
    getLinks <- function() {
        links <- character()
        list(a = function(node, ...) {
                links <<- c(links, xmlGetAttr(node, "href"))
                node
             },
             links = function() links)
        }
    h1 <- getLinks()
    htmlTreeParse(URL, handlers = h1)
    h1$links()
}

links("http://www.bvl.com.pe/includes/empresas_todas.dat")
#  [1] "/inf_corporativa71050_JAIME1CP1A.html"
#  [2] "/inf_corporativa10400_INTEGRC1.html"  
#  [3] "/inf_corporativa66100_ACESEGC1.html"  
#  [4] "/inf_corporativa71300_ADCOMEC1.html"  
#  [5] "/inf_corporativa10250_HABITAC1.html"  
#  [6] "/inf_corporativa77900_PARAMOC1.html"  
#  [7] "/inf_corporativa77935_PUCALAC1.html"  
#  [8] "/inf_corporativa77600_LAREDOC1.html"  
#  [9] "/inf_corporativa21000_AIBC1.html"     
#  ...
#  ...
# Option 1
library(RCurl)
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')

# Option 2
library(rvest)
library(pipeR) # %>>% will be faster than %>%
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")

Richard's answer works for HTTP pages but not the HTTPS page I needed (Wikipedia). I substituted RCurl's getURL function as below:

library(RCurl)

links <- function(URL) 
{
  getLinks <- function() {
    links <- character()
    list(a = function(node, ...) {
      links <<- c(links, xmlGetAttr(node, "href"))
      node
    },
    links = function() links)
  }
  h1 <- getLinks()
  xData <- getURL(URL)
   htmlTreeParse(xData, handlers = h1)
  h1$links()
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!