Does R have any package for parsing out the parts of a URL?

后端 未结 6 730
无人及你
无人及你 2020-12-30 06:32

I have a list of urls that I would like to parse and normalize.

I\'d like to be able to split each address into parts so that I can identify \"www.google.com/test/in

6条回答
  •  时光取名叫无心
    2020-12-30 06:56

    I'd forgo a package and use regex for this.

    EDIT reformulated after the robot attack from Dason...

    x <- c("talkstats.com", "www.google.com/test/index.asp", 
        "google.com/somethingelse", "www.stackoverflow.com",
        "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")
    
    parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
    parser(x)
    
    lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
    names(lst) <- unique(parser(x))
    lst
    
    ## $talkstats.com
    ## [1] "talkstats.com"
    ## 
    ## $google.com
    ## [1] "www.google.com/test/index.asp" "google.com/somethingelse"     
    ## 
    ## $stackoverflow.com
    ## [1] "www.stackoverflow.com"
    ## 
    ## $bing.com
    ## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="
    

    This may need to be extended depending on the structure of the data.

提交回复
热议问题