Regular expression for the “opposite” result

☆樱花仙子☆ 提交于 2019-12-24 01:47:15

问题


Take the following character vector x

x <- c("1     Date in the form", "2     Number of game", 
       "3     Day of week", "4-5     Visiting team and league")

My desired result is the following vector, with the first capitalized word from each string and, if the string contains a -, also the last word.

[1] "Date"     "Number"   "Day"      "Visiting" "league"  

So instead of doing

unlist(sapply(strsplit(x, "[[:blank:]]+|, "), function(y){
   if(grepl("[-]", y[1])) c(y[2], tail(y,1)) else y[2] 
}))

to get the result, I figured I could try to shorten it to a regular expression. The result is almost the "opposite" of this regular expression in sub. I've tried it every which way to get the opposite, with different varieties of [^A-Za-z]+ among others, and haven't been successful.

> sub("[A-Z][a-z]+", "", x)
[1] "1      in the form"       "2      of game"           
[3] "3      of week"           "4-5      team and league"

So I guess this is a two part question.

  1. with sub() or gsub(), how can I return the opposite of "[A-Z][a-z]+"?

  2. How can I write the regular expression to read like "Match the first capitalized word and, if the string contains a -, also match the last word."?


回答1:


Here are some suggestions:

  1. To extract the first capitalized word with sub, you can use

    sub(".*\\b([A-Z].*?)\\b.*", "\\1", x)
    #[1] "Date"     "Number"   "Day"      "Visiting"
    

    where \\b represents a word boundary.

  2. You can also extract all word with one sub command, but note that you have to apply an extra step because the length of the vector returned by sub is identical to the length of the input vector x.

    The following regular expression makes use of a lookahead ((?=.*-)) to test if there's a - in the string. If it is the case, two words are extracted. If it's not present, the regular expression after the logical or (|) is applied and returns the first capitalized word only.

    res <- sub("(?:(?=.*-).*\\b([A-Z].*?\\b ).*\\b(.+)$)|(?:.*\\b([A-Z].*?)\\b.*)", 
               "\\1\\2\\3", x, perl = TRUE)
    # [1] "Date"            "Number"          "Day"             "Visiting league"
    

    One additional step is necessary in order to separate multiple words in the same string:

    unlist(strsplit(res, " ", fixed = TRUE))
    # [1] "Date"     "Number"   "Day"      "Visiting" "league"  
    



回答2:


Here is a solution using three regular expressions.

cap_words <- regmatches(x, regexpr("[A-Z][a-z]+", x))   # capitalised word
last_words <- sub(".*\\s", "", x[grep("-", x)]) # get last word in strings with a dash
c(cap_words, last_words)
# [1] "Date"     "Number"   "Day"      "Visiting" "league" 


来源:https://stackoverflow.com/questions/24670038/regular-expression-for-the-opposite-result

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!