Extract Links from Webpage using R

前端 未结 3 533
庸人自扰
庸人自扰 2020-12-23 02:07

The two posts below are great examples of different approaches of extracting data from websites and parsing it into R.

Scraping html tables into R data frames usin

相关标签:
3条回答
  • 2020-12-23 02:45

    You might try

    htmlcode = read_html("URL")
    nodes=html_nodes(htmlcode,xpath='//*[contains(@href, "SEARCHTERM")]') %>% html_attr("href")
    df=as.data.frame(as.character(nodes))
    names(df)="link"
    
    0 讨论(0)
  • 2020-12-23 02:50

    The documentation for htmlTreeParse shows one method. Here's another:

    > url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
    > doc <- htmlParse(url)
    > links <- xpathSApply(doc, "//a/@href")
    > free(doc)
    

    (You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)

    My previous reply:

    One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).

    > url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
    > html <- paste(readLines(url), collapse="\n")
    > library(stringr)
    > matched <- str_match_all(html, "<a href=\"(.*?)\"")
    

    (I guess some people might not approve of using regexp's here.)

    matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).

    > links <- matched[[1]][, 2]
    > head(links)
    [1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"
    [2] "http://careers.stackoverflow.com"                                                  
    [3] "http://meta.stackoverflow.com"                                                     
    [4] "/about"                                                                            
    [5] "/faq"                                                                              
    [6] "/"
    
    0 讨论(0)
  • 2020-12-23 03:10

    Even easier with rvest:

    library(xml2)
    library(rvest)
    
    URL <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
    
    pg <- read_html(URL)
    
    head(html_attr(html_nodes(pg, "a"), "href"))
    
    ## [1] "//stackoverflow.com"                                                                                                                                          
    ## [2] "http://chat.stackoverflow.com"                                                                                                                                
    ## [3] "//stackoverflow.com"                                                                                                                                          
    ## [4] "http://meta.stackoverflow.com"                                                                                                                                
    ## [5] "//careers.stackoverflow.com?utm_source=stackoverflow.com&utm_medium=site-ui&utm_campaign=multicollider"                                                       
    ## [6] "https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=http%3a%2f%2fstackoverflow.com%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"
    
    0 讨论(0)
提交回复
热议问题