Create new column from an existing column with pattern matching in R

问题

I'm trying to add a new column based on another using pattern matching. I've read this post, but not getting the desired output.

I want to create a new column (SubOrder) based on the GreatGroup column. I have tried the following:

SubOrder <- rep(NA_character_, length(myData))

SubOrder[grepl("udults", myData, ignore.case = TRUE)] <-  "Udults"
SubOrder[grepl("aquults", myData, ignore.case = TRUE)] <-  "Aquults"
SubOrder[grepl("aqualfs", myData, ignore.case = TRUE)] <-  "aqualfs"
SubOrder[grepl("humods", myData, ignore.case = TRUE)] <-  "humods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("orthods", myData, ignore.case = TRUE)] <-  "orthods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("psamments", myData, ignore.case = TRUE)] <-  "psamments"
SubOrder[grepl("udepts", myData, ignore.case = TRUE)] <-  "udepts"
SubOrder[grepl("fluvents", myData, ignore.case = TRUE)] <-  "fluvents"
SubOrder[grepl("aquods", myData, ignore.case = TRUE)] <-  "aquods"

For example, I'm looking for "udults" inside any word, such as Hapludults or Paleudults, and return just "udults".

EDIT: If anyone wants to take a shot at alistaire's comment, this is the search patterns I would use.

 subOrderNames <- c("Udults", "Aquults", "Aqualfs", "Humods", "Udalfs", "Orthods", "Psamments", "Udepts", "fluvents")

Example data below.

myData <- dput(head(test))
structure(list(1:6, SID = c(200502L, 200502L, 200502L, 200502L, 
200502L, 200502L), Groupdepth = c(11L, 12L, 13L, 14L, 21L, 22L
), AWC0to10 = c(0.12, 0.12, 0.12, 0.12, 0.12, 0.12), AWC10to20 = c(0.12, 
0.12, 0.12, 0.12, 0.12, 0.12), AWC20to50 = c(0.12, 0.12, 0.12, 
0.12, 0.12, 0.12), AWC50to100 = c(0.15, 0.15, 0.15, 0.15, 0.15, 
0.15), Db3rdbar0to10 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), 
    Db3rdbar10to20 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar20to50 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar50to100 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), HydrcRatngPP = c(0L, 0L, 0L, 
    0L, 0L, 0L), OrgMatter0to10 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter10to20 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter20to50 = c(1.02, 1.02, 1.02, 1.02, 1.02, 
    1.02), OrgMatter50to100 = c(0.12, 0.12, 0.12, 0.12, 0.12, 
    0.12), Clay0to10 = c(8, 8, 8, 8, 8, 8), Clay10to20 = c(8, 
    8, 8, 8, 8, 8), Clay20to50 = c(9.4, 9.4, 9.4, 9.4, 9.4, 9.4
    ), Clay50to100 = c(40, 40, 40, 40, 40, 40), Sand0to10 = c(85, 
    85, 85, 85, 85, 85), Sand10to20 = c(85, 85, 85, 85, 85, 85
    ), Sand20to50 = c(83, 83, 83, 83, 83, 83), Sand50to100 = c(45.8, 
    45.8, 45.8, 45.8, 45.8, 45.8), pHwater0to20 = c(6.3, 6.3, 
    6.3, 6.3, 6.3, 6.3), Ksat0to10 = c(23, 23, 23, 23, 23, 23
    ), Ksat10to20 = c(23, 23, 23, 23, 23, 23), Ksat20to50 = c(19.7333, 
    19.7333, 19.7333, 19.7333, 19.7333, 19.7333), Ksat50to100 = c(9, 
    9, 9, 9, 9, 9), TaxClName = c("Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults"), GreatGroup = c("Hapludults", 
    "Hapludults", "Hapludults", "Hapludults", "Hapludults", "Hapludults"
    )), .Names = c("", "SID", "Groupdepth", "AWC0to10", "AWC10to20", 
"AWC20to50", "AWC50to100", "Db3rdbar0to10", "Db3rdbar10to20", 
"Db3rdbar20to50", "Db3rdbar50to100", "HydrcRatngPP", "OrgMatter0to10", 
"OrgMatter10to20", "OrgMatter20to50", "OrgMatter50to100", "Clay0to10", 
"Clay10to20", "Clay20to50", "Clay50to100", "Sand0to10", "Sand10to20", 
"Sand20to50", "Sand50to100", "pHwater0to20", "Ksat0to10", "Ksat10to20", 
"Ksat20to50", "Ksat50to100", "TaxClName", "GreatGroup"), class = c("tbl_df", 
"data.frame"), row.names = c(NA, -6L))

回答1:

A few options, some of which I posted in the comments above.

Note: All options assume the replacement for the strings that match patters are just the pattern. If you want something else, they're all easily editable to include separate replacement values.

Option 1: `for` + `grepl`

Using the same code as the original, but looping to avoid repetitive code:

# make a list of patterns
pat <- c('udults', 'aquults', 'aqualfs', 'humods', 'udalfs', 'orthods', 'psamments', 'udepts', 'fluvents', 'aquods')

SubOrder <- rep(NA_character_, length(myData))

for(x in 1:length(pat)){
  SubOrder[grepl(pat[x], myData$GreatGroup, ignore.case = TRUE)] <-  pat[x]
}

Option 2: `for` + `gsub`

Build the new column in place by copying myData$GreatGroup and then altering it with gsub. The extra regex pasted on includes characters within the same string.

myData$SubOrder <- myData$GreatGroup
for(x in pat){
  myData$SubOrder <- gsub(paste0('.*', x, '.*'), x, myData$SubOrder, ignore.case = TRUE)
}

Note that values not matched by one of the strings in pat will have the value from GreatGroup, not NA. If you want them to be NA, fix them with

myData$SubOrder[!(myData$SubOrder %in% pat)] <- NA

Option 3: named list + `stringr::str_replace_all`

My favorite because it doesn't loop, although it requires the stringr package (which is pretty awesome, anyway).

Make a named list from pat, where the name is the regex you want to replace, and the item is the string to match:

l <- as.list(pat)
names(l) <- paste0('.*', pat, '.*')

so it looks like

> l
$`.*udults.*`
[1] "udults"

$`.*aquults.*`
[1] "aquults"

$`.*aqualfs.*`
[1] "aqualfs"
......

Then use str_replace_all to DO IT ALL AT ONCE:

myData$SubOrder <- str_replace_all(myData$GreatGroup, l)

Boom.

Note 1: str_replace_all doesn't have an ignore.case option, but you can wrap myData$GreatGroup in tolower (easy) or reconfigure the regex (hard).

Note 2: Like Option 2, it leaves unmatched entries as the value from GreatGroup, so use the line at the end of that option to go back to NAs, if you like.

回答2:

I'm using dplyr, but you probably need to create a giant nested ifelse statement...

library(dplyr)

myData %>%
  mutate(SubOrder = ifelse(grepl('udults', GreatGroup, ignore.case = T), 'Udults',
                           ifelse(grepl('aquults', GreatGroup, ignore.case = T, 'Aquults',
                                        ###  All of the other ifelse statements
                                        ifelse(grepl('fluvents', GreatGroup, ignore.case = T), 'fluvents', 'aquods')
                           ))))

回答3:

Try this:

myData$SubOrder[grepl("udults", myData$TaxClName, ignore.case = TRUE) | grepl("udults", myData$GreatGroup, ignore.case = TRUE)] <-  "Udults"

You can add as many columns to the filter as you want.

回答4:

You could do this with a function that successively substitutes each pattern, which avoids repeating your code over and over. Note that with this approach, if a given string matches more than one pattern, the first pattern in the substitution sequence will be the one that gets used.

# multi-grepl function adapted from http://stackoverflow.com/a/15254254/496488
mgrepl <- function(pattern, replacement, x, ...) {
  if (length(pattern) != length(replacement)) {
    stop("pattern and replacement do not have the same length.")
  }
  result <- x
  for (i in 1:length(pattern)) {
    result[grepl(pattern[i], result, ...)] = replacement[i]
  }
  result
}

# Patterns and replacements
pat = c("udults","aquults","humods","fluvents")
repl = c("Udults","Aquults","humods","fluvents")

SubOrder =  mgrepl(pat, repl, myData$GreatGroup)

SubOrder

[1] "Udults" "Udults" "Udults" "Udults" "Udults" "Udults"

# Or, if you want to add this as a new column to the data:
myData$SubOrder = mgrepl(pat, repl, myData$GreatGroup)

One additional note: One issue with the code in your question is that you referenced the whole data frame, rather than the column you want to substitute:

SubOrder[grepl("udults", myData, ignore.case = TRUE)] <-  "Udults"

should be changed to

SubOrder[grepl("udults", myData$GreatGroup, ignore.case = TRUE)] <-  "Udults"

UPDATE: Regarding your comment, see the code below. The function does replace both values with "Udults".

myData$GreatGroup[1] = "Paleudults"

myData$GreatGroup

[1] "Paleudults" "Hapludults" "Hapludults" "Hapludults" "Hapludults" "Hapludults"

mgrepl(pat, repl, myData$GreatGroup)

[1] "Udults" "Udults" "Udults" "Udults" "Udults" "Udults"

来源：https://stackoverflow.com/questions/35233744/create-new-column-from-an-existing-column-with-pattern-matching-in-r

标签

regex

grep

pattern-matching