问题
I need some help with pattern matching in R. I need to extract a whole word that starts with a common prefix, from a long character string. The word I want to extract always starts with the same prefix (AA), but the word is not the same length, and does not occur in the same location of the string.
mytext1 <- as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH") # Return AA12345
mytext2 <- as.character("ELEPHANT AA100 KOALA POLAR.BEAR") # Want to return AA100
mytext3 <- as.character("CROCODILE DRAGON.FLY ANTELOPE") # Want to return NA
As an extension of this, what if there were two different patterns to match and I wanted to return a character string with both?
mytext4 <- as.character("TULIP AA999 DAISY BB123")
# Pattern matching to AA and BB
# Want to return AA999 BB123
Any help with this would be greatly appreciated :)
回答1:
Here is a stringr approach. The regular expression matches AA preceded by a space or the start of the string (?<=^| ), and then as few characters as possible .*? until the next space or the end of the string (?=$| ). Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown.
mytext <- c(
as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)
library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100" "AA3273" "AA5785" NA "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#>
#> [[2]]
#> [1] "AA100"
#>
#> [[3]]
#> [1] "AA3273"
#>
#> [[4]]
#> [1] "AA5785"
#>
#> [[5]]
#> character(0)
#>
#> [[6]]
#> [1] "AA12345" "AA5785"
as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"
Created on 2018-04-29 by the reprex package (v0.2.0).
回答2:
You can get a base R solution using sub
sub(".*\\b(AA\\w*).*", "\\1", mytext1)
[1] "AA12345"
> sub(".*\\b(AA\\w*).*", "\\1", mytext2)
[1] "AA100"
回答3:
I like keeping things in base R whenever possible, and there is already a solution for this. What you really are looking for is the regmatches() function. See Here
Extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec.
To solve your specific problem
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext1, perl=T)
regmatches(mytext1, matches)
> [1] "AA12345"
When there is no match:
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext3, perl=T)
regmatches(mytext3, matches)
> character(0)
If you want to avoid character(0) put your strings in a vector and run them all at once.
alltext = c(mytext1, mytext2, mytext3)
matches = regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T)
regmatches(alltext, matches)
> [1] "AA12345" "AA100"
And finally, if you want a one-liner
regmatches(alltext, regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T))
> [1] "AA12345" "AA100"
来源:https://stackoverflow.com/questions/50053168/r-extract-a-word-from-a-character-string-using-pattern-matching