问题
I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:
data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))
It seems that fixed=FALSE
should tell grepl
that its a regular expression, but R doesn't like how I am trying to save the regex as:
regex <- "http.*?1-\\d+,\\d+"
My data is organized in a data frame like this:
data <- read.table(text='"Content" "date"
1 "a house a home https://www.foo.com" "12/31/2013"
2 "cabin ideas https://www.example.com in the woods" "5/4/2013"
3 "motel is a hotel" "1/4/2013"', header=TRUE)
And would hopefully look like:
Content date ContentURL
1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
3 motel is a hotel 1/4/2013
回答1:
Hadleyverse solution (stringr
package) with a decent URL pattern:
library(stringr)
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
data$ContentURL <- str_extract(data$Content, url_pattern)
data
## Content date ContentURL
## 1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
## 3 motel is a hotel 1/4/2013 <NA>
You can use str_extract_all
if there are multiples in Content
, but that will involve some extra processing on your end afterwards.
回答2:
Here's one approach using the qdapRegex
library:
library(qdapRegex)
data[["url"]] <- unlist(rm_url(data[["Content"]], extract=TRUE))
data
## Content date url
## 1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
## 3 motel is a hotel 1/4/2013 <NA>
To see the regular expression used by the function (as qdapRegex
aims to help analyze and educate about regexs) you can use the grab
function with the function name prefixed with @
:
grab("@rm_url")
## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"
grepl
tells you a logical output of yes this string contains or no it does not. grep
tells you the indexes or gives the values but values are the whole string nut the substring you want.
So to pass this regex along to base or the stringi package (qdapRegex wraps stingi for extraction) you could do:
regmatches(data[["Content"]], gregexpr(grab("@rm_url"), data[["Content"]], perl = TRUE))
library(stringi)
stri_extract(data[["Content"]], regex=grab("@rm_url"))
I'm sure there's a stringr approach too but am not familiar with the package.
回答3:
Split on space then find "http":
data$ContentURL <- unlist(sapply(strsplit(as.character(data$Content), split = " "),
function(i){
x <- i[ grepl("http", i)]
if(length(x) == 0) x <- NA
x
}))
data
# Content date ContentURL
# 1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
# 2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
# 3 motel is a hotel 1/4/2013 <NA>
回答4:
You can use the package unglue :
library(unglue)
unglue_unnest(data,Content, "{=.*?}{url=http[^ ]*}{=.*?}",remove = FALSE)
#> Content date url
#> 1 a house a home https://www.f00.com 12/31/2013 1 https://www.f00.com
#> 2 cabin ideas https://www.example.com in the woods 5/4/2013 2 https://www.example.com
#> 3 motel is a hotel 1/4/2013 3 <NA>
{=.*?}
matches anything and is not assigned to an extracted column so the lhs of=
is empty{url=http[^ ]*}
matches something that starts withhttp
and is followed by non spaces, as the lhs isurl
it is extracted intourl
Ps: I manually changed foo
into f00
in my answer because of SO restrictions
来源:https://stackoverflow.com/questions/26496538/extract-urls-with-regex-into-a-new-data-frame-column