The two posts below are great examples of different approaches of extracting data from websites and parsing it into R.
Scraping html tables into R data frames usin
The documentation for htmlTreeParse shows one method. Here's another:
> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)
(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)
My previous reply:
One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).
> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> html <- paste(readLines(url), collapse="\n")
> library(stringr)
> matched <- str_match_all(html, "
(I guess some people might not approve of using regexp's here.)
matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).
> links <- matched[[1]][, 2]
> head(links)
[1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"
[2] "http://careers.stackoverflow.com"
[3] "http://meta.stackoverflow.com"
[4] "/about"
[5] "/faq"
[6] "/"