Extract URLs with regex into a new data frame column

那年仲夏 提交于 2019-11-29 15:22:17

问题


I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

It seems that fixed=FALSE should tell grepl that its a regular expression, but R doesn't like how I am trying to save the regex as:

regex <- "http.*?1-\\d+,\\d+"

My data is organized in a data frame like this:

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

And would hopefully look like:

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013                        

回答1:


Hadleyverse solution (stringr package) with a decent URL pattern:

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

You can use str_extract_all if there are multiples in Content, but that will involve some extra processing on your end afterwards.




回答2:


Here's one approach using the qdapRegex library:

library(qdapRegex)
data[["url"]] <- unlist(rm_url(data[["Content"]], extract=TRUE))
data

##                                            Content       date                     url
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

To see the regular expression used by the function (as qdapRegex aims to help analyze and educate about regexs) you can use the grab function with the function name prefixed with @:

grab("@rm_url")

## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

grepl tells you a logical output of yes this string contains or no it does not. grep tells you the indexes or gives the values but values are the whole string nut the substring you want.

So to pass this regex along to base or the stringi package (qdapRegex wraps stingi for extraction) you could do:

regmatches(data[["Content"]], gregexpr(grab("@rm_url"), data[["Content"]], perl = TRUE))

library(stringi)
stri_extract(data[["Content"]], regex=grab("@rm_url"))

I'm sure there's a stringr approach too but am not familiar with the package.




回答3:


Split on space then find "http":

data$ContentURL <- unlist(sapply(strsplit(as.character(data$Content), split = " "),
                                 function(i){
                                   x <- i[ grepl("http", i)]
                                   if(length(x) == 0) x <- NA
                                   x
                                 }))


data
#                                            Content       date              ContentURL
# 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
# 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
# 3                                 motel is a hotel   1/4/2013                    <NA>



回答4:


You can use the package unglue :

library(unglue)
unglue_unnest(data,Content, "{=.*?}{url=http[^ ]*}{=.*?}",remove = FALSE)
#>                                            Content       date                       url
#> 1               a house a home https://www.f00.com 12/31/2013 1     https://www.f00.com
#> 2 cabin ideas https://www.example.com in the woods   5/4/2013 2 https://www.example.com
#> 3                                 motel is a hotel   1/4/2013 3                    <NA>
  • {=.*?} matches anything and is not assigned to an extracted column so the lhs of = is empty
  • {url=http[^ ]*} matches something that starts with http and is followed by non spaces, as the lhs is url it is extracted into url

Ps: I manually changed foo into f00 in my answer because of SO restrictions



来源:https://stackoverflow.com/questions/26496538/extract-urls-with-regex-into-a-new-data-frame-column

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!