R regex find last occurrence of delimiter

后端 未结 4 1162
情深已故
情深已故 2020-12-06 13:58

I\'m trying to get the ending for email addresses (ie .net, .com, .edu, etc.) but the portion after the @ can have multiple periods.

library(stringi)

stri         


        
相关标签:
4条回答
  • 2020-12-06 14:20

    Here are a few approaches. The first seems particularly straight foward and the second particularly short.

    1) sub That can be done with a an application of sub in R to produce each column:

    data.frame(X1 = sub("@.*", "", strings1), 
               X2 = sub(".*@", "", strings1), 
               X3 = sub(".*[.]", "", strings1), 
               stringsAsFactors = FALSE)
    

    giving:

        X1            X2  X3
    1 test       aol.com com
    2 test   hotmail.com com
    3 test    xyz.rr.edu edu
    4 test abc.xx.zz.net net
    

    2) strapplyc Here is an alternative using the gsubfn package that is particularly short. This returns a character matrix. strappylyc returns the matches to the portions of the pattern in parentheses. The first set of parantheses matches everything before @, the second set of parentheses matches everything after @ and the last set of parentheses matches everything after the last dot.

    library(gsubfn)
    pat <- "(.*)@(.*[.](.*))"
    t(strapplyc(strings1, pat, simplify = TRUE))
    
         [,1]   [,2]            [,3] 
    [1,] "test" "aol.com"       "com"
    [2,] "test" "hotmail.com"   "com"
    [3,] "test" "xyz.rr.edu"    "edu"
    [4,] "test" "abc.xx.zz.net" "net"
    

    2a) read.pattern read.pattern also in the gsubfn package can do it using the same pat defined in (2):

    library(gsubfn)
    pat <- "(.*)@(.*[.](.*))"
    read.pattern(text = strings1, pat, as.is = TRUE)
    

    giving a data.frame similar to (1) except the column names are V1, V2 and V3.

    3) strsplit The overlapping extractions make it difficult to do with strsplit but we can do it with two applications of strsplit. The first strsplit splits at the @ and the second uses everything up to the last dot to split on. This last strsplit always produces an empty string as the first split string and we delete this using [, -1]. This gives a character matrix:

     ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
     cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )
    

    giving the same answer as (2).

    4) strsplit/sub This is a mix of (1) and (3):

    cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))
    

    giving the same answer as (2).

    4a) This is another way to use strsplit and sub. Here we append a @ followed by the TLD and then split on @.

    do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))
    

    giving the same answer as (2).

    Update Added additional solutions.

    0 讨论(0)
  • 2020-12-06 14:25

    So this is a negate lookahead regex that should give you the last .word of that line.

    \.(?!.*\.)\w+       
    
    0 讨论(0)
  • 2020-12-06 14:30

    A solution using basic regex, assuming df1$X2 is a character vector:

    df1 <- cbind(df1, X3 = regmatches(df1$X2, regexpr('\\.[A-Z|a-z]*$', df1$X2)))
    df1$X3 <- gsub("\\.", "", df1$X3)
    
    0 讨论(0)
  • 2020-12-06 14:43

    A read.table + file_ext approach (not regex but pretty easy):

    dat <- read.table(text=strings1, sep="@")
    dat$V3 <- tools::file_ext(strings1)
    dat
    
    ##     V1            V2  V3
    ## 1 test       aol.com com
    ## 2 test   hotmail.com com
    ## 3 test    xyz.rr.edu edu
    ## 4 test abc.xx.zz.net net
    

    Here's a purely regex approach:

    do.call(rbind, strsplit(strings1, "@|\\.(?=[^\\.]+$)", perl=TRUE))
    
    ##     [,1]   [,2]        [,3] 
    ## [1,] "test" "aol"       "com"
    ## [2,] "test" "hotmail"   "com"
    ## [3,] "test" "xyz.rr"    "edu"
    ## [4,] "test" "abc.xx.zz" "net"
    
    0 讨论(0)
提交回复
热议问题