conditional string splitting in R (using tidyr)

问题

I have a data frame like this:

X <- data.frame(value = c(1,2,3,4), 
                variable = c("cost", "cost", "reed_cost", "reed_cost"))

I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)

If my data were something nicer, say:

Y <- data.frame(value = c(1,2,3,4), 
                variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))

Then this is trivial with tidyr:

separate(Y, variable, c("Type", "Model"), "_")

and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").

I tried:

separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)

but no luck. I realize I cannot even split to an empty string successfully:

separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)

how should I do this?

Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.

I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.

I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...

回答1:

Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?

Example:

library(stringr)
X <- data.frame(value = c(1,2,3,4), 
                variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")

回答2:

You could try:

X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))

 separate(X, variable, c("Policy-cost", "Reed"), "_")
 # value Policy-cost Reed
 #1     1             cost
 #2     2             cost
 #3     3        reed cost
 #4     4        reed cost

X$variable <-  gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
 X$variable
#[1] "_cost"     "_cost"     "reed_cost" "reed_cost"

 separate(X, variable, c("Policy-cost", "Reed"), "_")

Explanation

\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.

回答3:

Another approach with base R:

cbind(X["value"], 
      setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"), 
                                      function(x) 
                                        if (length(x) == 1) c("", x) 
                                        else x))), 
               c("Policy-cost", "Reed")))

#   value Policy-cost Reed
# 1     1             cost
# 2     2             cost
# 3     3        reed cost
# 4     4        reed cost

来源：https://stackoverflow.com/questions/24939601/conditional-string-splitting-in-r-using-tidyr

标签

regex

tidyr