I have a list of urls that I would like to parse and normalize.
I\'d like to be able to split each address into parts so that I can identify \"www.google.com/test/in
I'd forgo a package and use regex for this.
EDIT reformulated after the robot attack from Dason...
x <- c("talkstats.com", "www.google.com/test/index.asp",
"google.com/somethingelse", "www.stackoverflow.com",
"http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")
parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
parser(x)
lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
names(lst) <- unique(parser(x))
lst
## $talkstats.com
## [1] "talkstats.com"
##
## $google.com
## [1] "www.google.com/test/index.asp" "google.com/somethingelse"
##
## $stackoverflow.com
## [1] "www.stackoverflow.com"
##
## $bing.com
## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="
This may need to be extended depending on the structure of the data.