Remove URLs from string

I have a vector of strings—myStrings—in R that look something like:

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

where another url is a valid http url but stackoverflow will not let me insert more than one url thats why i'm writing another url instead. I want to remove all the urls from myStrings to look like:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

I've tried many functions in the stringr package but nothing works.

You can use gsub with a regular expression to match URLs,

Set up a vector:

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

Remove all the URLs from each string:

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"

Update: It would be best if you could post a few different URLs so we know what we're working with. But I think this regular expression will work for the URLs you mentioned in the comments:

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

The above expression explained:

? optional space
(f|ht) match "f" or "ht"
tp match "tp"
(s?) optionally match "s" if it's there
(://) match "://"
(.*) match every character (everything) up to
[.|/] a period or a forward-slash
(.*) then everything after that

I'm not an expert with regular expressions, but I think I explained that correctly.

Note: url shorteners are no longer allowed in SO answers, so I was forced to remove a section while making my most recent edit. See edit history for that part.

I've been working on a canned group of regular expressions for common tasks like this that I've thrown into a package, qdapRegex, on github that will eventually go to CRAN. It can also extract the pieces as well as sub them out. Feedback on the package for any taking a look is welcomed.

Here it is:

library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)

x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))

## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"         

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)

## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

Edit I saw that twitter links were not removed. I will not be adding this to the regex specific to the rm_url function but have added it to the dictionary in qdapRegex. So there's no specific function to remove standard urls and twitter both but the pastex (paste regular expression) allows you to easily grab regexes from the dictionary and past them together (using the pipe operator, |). Since all rm_XXX style functions work essentially the same you can pass the pastex output to the pattern argument of any rm_XXX function or create your own function as I show below:

rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)

 str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info")

 gsub('http\\S+\\s*',"", str1)
 #[1] "download file from "                         
 #[2] "this is the link to my website for more info"

 library(stringr)
 str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces
 #[1] "download file from"                          
 #[2] "this is the link to my website for more info"

Update

In order to match ftp, I would use the same idea from @Richard Scriven's post

  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info",
  "this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info")


  gsub('(f|ht)tp\\S+\\s*',"", str1)
  #[1] "download file from "                         
  #[2] "this is the link to my website for more info"
  #[3] "this link to gives more info"

Maurício Collaça

Some previous answers remove beyond the end of the URL and the "\b" extension would help. It could cover also the "sftp://" urls.

For regular urls:

gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", x)

For tiny urls:

gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x)

来源：https://stackoverflow.com/questions/25352448/remove-urls-from-string

标签

string

stringr