R Extract a word from a character string using pattern matching

问题

I need some help with pattern matching in R. I need to extract a whole word that starts with a common prefix, from a long character string. The word I want to extract always starts with the same prefix (AA), but the word is not the same length, and does not occur in the same location of the string.

mytext1 <- as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH") # Return AA12345

mytext2 <- as.character("ELEPHANT AA100 KOALA POLAR.BEAR") # Want to return AA100

mytext3 <- as.character("CROCODILE DRAGON.FLY ANTELOPE") # Want to return NA

As an extension of this, what if there were two different patterns to match and I wanted to return a character string with both?

mytext4 <- as.character("TULIP AA999 DAISY BB123") 
# Pattern matching to AA and BB 
# Want to return AA999 BB123

Any help with this would be greatly appreciated :)

回答1:

Here is a stringr approach. The regular expression matches AA preceded by a space or the start of the string (?<=^| ), and then as few characters as possible .*? until the next space or the end of the string (?=$| ). Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown.

mytext <- c(
  as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
  as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
  as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
  as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
  as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
  as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)

library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100"   "AA3273"  "AA5785"  NA        "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#> 
#> [[2]]
#> [1] "AA100"
#> 
#> [[3]]
#> [1] "AA3273"
#> 
#> [[4]]
#> [1] "AA5785"
#> 
#> [[5]]
#> character(0)
#> 
#> [[6]]
#> [1] "AA12345" "AA5785"

as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"

Created on 2018-04-29 by the reprex package (v0.2.0).

回答2:

You can get a base R solution using sub

sub(".*\\b(AA\\w*).*", "\\1", mytext1)
[1] "AA12345"
> sub(".*\\b(AA\\w*).*", "\\1", mytext2)
[1] "AA100"

回答3:

I like keeping things in base R whenever possible, and there is already a solution for this. What you really are looking for is the regmatches() function. See Here

Extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec.

To solve your specific problem

matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext1, perl=T)
regmatches(mytext1, matches)
> [1] "AA12345"

When there is no match:

matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext3, perl=T)
regmatches(mytext3, matches)
> character(0)

If you want to avoid character(0) put your strings in a vector and run them all at once.

alltext = c(mytext1, mytext2, mytext3)
matches = regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T)
regmatches(alltext, matches)
> [1] "AA12345" "AA100"

And finally, if you want a one-liner

regmatches(alltext, regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T))
> [1] "AA12345" "AA100"

来源：https://stackoverflow.com/questions/50053168/r-extract-a-word-from-a-character-string-using-pattern-matching

标签

regex

pattern-matching