R gsub to extract emails from text

问题

I have a variable a created by readLines of a file which contains some emails. I already filtered only those rows whith the @ symbol, and now am struggling to grab the emails. The text in my variable looks like this:

> dput(a[1:5])
c("buenas tardes. excelente. por favor a: Saolonm@hotmail.com", 
"26.leonard@gmail.com ", "Aprecio tu aporte , mi correo es jcdavola31@gmail.com , Muchas Gracias", 
"gracias andrescarnederes@headset.cl", "Me apunto, muchas gracias mi direcciÃ³n luciana.chavela.ecuador@gmail.com me serÃ¡ de mucha utilidad. "
)

From this question in SO I got a starting point to extract the emails (@Aaron Haurun's answer), which slightly modified (I added a [\w.] before the @ to address emails with . between names) worked well in regex101.com to extract the emails. However, it fails when I port it to gsub:

> gsub("()(\\w[\\w.]+@[\\w.-]+|\\{(?:\\w+, *)+\\w+\\}@[\\w.-]+)()", 
       "\\2", 
       a[1:5], 
       perl = FALSE) ## It doesn't matter if I use perl = TRUE

[1] "buenas tardes. excelente. por favor a: Saolonm@hotmail.com"           "26.leonard@gmail.com "                                                                          
[3] "Aprecio tu aporte , mi correo es jcdavola31@gmail.com , Muchas Gracias"                           "gracias andrescarnederes@headset.cl"                                                                       
[5] "Me apunto, muchas gracias mi direcciÃ³n luciana.chavela.ecuador@gmail.com me serÃ¡ de mucha utilidad. "

What am I doing wrong and how can I grab those emails? Thanks!

回答1:

We can try the str_extract() from stringr package:

str_extract(text, "\\S*@\\S*")

[1] "Saolonm@hotmail.com"              
[2] "26.leonard@gmail.com"             
[3] "jcdavola31@gmail.com"             
[4] "andrescarnederes@headset.cl"      
[5] "luciana.chavela.ecuador@gmail.com"

where \\S* match any number of non-space character.

回答2:

From the answer you posted in your question,

library(stringr)
str_extract(a, '\\S+@\\S+|\\{(?:\\w+, *)+\\w+\\}@[\\w.-]+')
#[1] "Saolonm@hotmail.com"               "26.leonard@gmail.com"              "jcdavola31@gmail.com"              "andrescarnederes@headset.cl"      
#[5] "luciana.chavela.ecuador@gmail.com"

回答3:

We can use base R options to do this

unlist(regmatches(a, gregexpr("\\S+@\\S+", a)))
#[1] "Saolonm@hotmail.com"    
#[2]"26.leonard@gmail.com" 
#[3] "jcdavola31@gmail.com"             
#[4] "andrescarnederes@headset.cl"
#[5] "luciana.chavela.ecuador@gmail.com"

Or as the OP's post is about a solution with gsub/sub

sub("(.*\\s+|^)(\\S+@\\S+).*", "\\2", a)
#[1] "Saolonm@hotmail.com" 
#[2] "26.leonard@gmail.com" 
#[3] "jcdavola31@gmail.com"             
#[4] "andrescarnederes@headset.cl"  
#[5] "luciana.chavela.ecuador@gmail.com"

来源：https://stackoverflow.com/questions/37681197/r-gsub-to-extract-emails-from-text

标签

regex

gsub