R Extract a word from a character string using pattern matching

别等时光非礼了梦想. 提交于 2019-12-04 22:55:29

Here is a stringr approach. The regular expression matches AA preceded by a space or the start of the string (?<=^| ), and then as few characters as possible .*? until the next space or the end of the string (?=$| ). Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown.

mytext <- c(
  as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
  as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
  as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
  as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
  as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
  as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)

library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100"   "AA3273"  "AA5785"  NA        "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#> 
#> [[2]]
#> [1] "AA100"
#> 
#> [[3]]
#> [1] "AA3273"
#> 
#> [[4]]
#> [1] "AA5785"
#> 
#> [[5]]
#> character(0)
#> 
#> [[6]]
#> [1] "AA12345" "AA5785"

as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"

Created on 2018-04-29 by the reprex package (v0.2.0).

You can get a base R solution using sub

sub(".*\\b(AA\\w*).*", "\\1", mytext1)
[1] "AA12345"
> sub(".*\\b(AA\\w*).*", "\\1", mytext2)
[1] "AA100"

I like keeping things in base R whenever possible, and there is already a solution for this. What you really are looking for is the regmatches() function. See Here

Extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec.

To solve your specific problem

matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext1, perl=T)
regmatches(mytext1, matches)
> [1] "AA12345"

When there is no match:

matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext3, perl=T)
regmatches(mytext3, matches)
> character(0)

If you want to avoid character(0) put your strings in a vector and run them all at once.

alltext = c(mytext1, mytext2, mytext3)
matches = regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T)
regmatches(alltext, matches)
> [1] "AA12345" "AA100"

And finally, if you want a one-liner

regmatches(alltext, regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T))
> [1] "AA12345" "AA100"
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!