问题
I have a complicated question that I will try to simplify by simplifying my dataset. Say I have 5 variables:
df$Id <- c(1:12)
df$Date <- c(NA,NA,a,a,b,NA,NA,b,c,c,b,a)
df$va <- c(1.1, 1.4, 2.5, ...) #12 randoms values
df$vb <- c(5.9, 2.3, 4.7, ...) #12 other random values
df$vc <- c(3.0, 3.3, 3.7, ...) #12 more random values
Then I want to create a new variable that takes the value from va, vb, or vc if the date is equal to a, b, or c. I had tried a nested if-else, which did not work. I also tried:
df$new[df$date=='a' & !is.na(df$date)] <- df$va
df$new[df$date=='b' & !is.na(df$date)] <- df$vb
df$new[df$date=='c' & !is.na(df$date)] <- df$vc
This correctly left NA's in the new variable where Date=NA, however the values provided were not from va, vb, or vc, but some other value altogether. How can I get df$new to equal va if the date is 'a', vb if the date is 'b', and vc if the date is 'c'?
回答1:
Try
library(dplyr)
df %>%
mutate(new = (Date=="a")*va + (Date=="b")*vb + (Date=="c")*vc)
# Id Date va vb vc new
#1 1 <NA> 0.26550866 0.6870228 0.26722067 NA
#2 2 <NA> 0.37212390 0.3841037 0.38611409 NA
#3 3 a 0.57285336 0.7698414 0.01339033 0.5728534
#4 4 a 0.90820779 0.4976992 0.38238796 0.9082078
#5 5 b 0.20168193 0.7176185 0.86969085 0.7176185
#6 6 <NA> 0.89838968 0.9919061 0.34034900 NA
#7 7 <NA> 0.94467527 0.3800352 0.48208012 NA
#8 8 b 0.66079779 0.7774452 0.59956583 0.7774452
#9 9 c 0.62911404 0.9347052 0.49354131 0.4935413
#10 10 c 0.06178627 0.2121425 0.18621760 0.1862176
#11 11 b 0.20597457 0.6516738 0.82737332 0.6516738
#12 12 a 0.17655675 0.1255551 0.66846674 0.1765568
Or,
library(data.table)
setDT(df)[,new:= (Date=="a")*va + (Date=="b")*vb + (Date=="c")*vc,]
Data
set.seed(1)
df <- data.frame(Id = 1:12,
Date = c(NA,NA,"a","a","b",NA,NA,"b","c","c","b","a"),
va = runif(12),
vb = runif(12),
vc = runif(12), stringsAsFactors = FALSE)
回答2:
You may also do row/column
indexing using base R
(using the data from @ExperimenteR). Even if there are 100's of unique
'Date' with corresponding 'v' columns, we may not need to change the code especially the cbind(..)
part.
df$new <- df[-(1:2)][cbind(1:nrow(df),match(df$Date, sort(unique(df$Date))))]
df
# Id Date va vb vc new
#1 1 <NA> 0.26550866 0.6870228 0.26722067 NA
#2 2 <NA> 0.37212390 0.3841037 0.38611409 NA
#3 3 a 0.57285336 0.7698414 0.01339033 0.5728534
#4 4 a 0.90820779 0.4976992 0.38238796 0.9082078
#5 5 b 0.20168193 0.7176185 0.86969085 0.7176185
#6 6 <NA> 0.89838968 0.9919061 0.34034900 NA
#7 7 <NA> 0.94467527 0.3800352 0.48208012 NA
#8 8 b 0.66079779 0.7774452 0.59956583 0.7774452
#9 9 c 0.62911404 0.9347052 0.49354131 0.4935413
#10 10 c 0.06178627 0.2121425 0.18621760 0.1862176
#11 11 b 0.20597457 0.6516738 0.82737332 0.6516738
#12 12 a 0.17655675 0.1255551 0.66846674 0.1765568
回答3:
I was told the problem with my code is that I needed to put indexing on either side. Without the indexing on the right side, it does not know which row to apply the value from. So the correct code in this case would be:
df$new[df$date=='a' & !is.na(df$date)] <- df$va[df$date=='a' & !is.na(df$date)]
df$new[df$date=='b' & !is.na(df$date)] <- df$vb[df$date=='b' & !is.na(df$date)]
df$new[df$date=='c' & !is.na(df$date)] <- df$vc[df$date=='c' & !is.na(df$date)]
Alternatively, another user noted there is a way to use ifelse, which can be viewed as the correct answer here: https://stats.stackexchange.com/questions/151345/how-to-create-a-new-variable-with-values-from-different-variables-if-another-var
As I added to his answer at that link, I found what worked better was to replace the == with %in%, so that it created a numeric variable instead of a list with a row for each of the 36121 observations in my dataset (12 in the example I provided). That would look like:
df$new[df$date %in% 'a' & !is.na(df$date)] <- df$va[df$date %in% 'a' & !is.na(df$date)]
df$new[df$date %in% 'b' & !is.na(df$date)] <- df$vb[df$date %in% 'b' & !is.na(df$date)]
df$new[df$date %in% 'c' & !is.na(df$date)] <- df$vc[df$date %in% 'c' & !is.na(df$date)]
来源:https://stackoverflow.com/questions/30113308/how-to-create-a-new-variable-with-values-from-different-variables-if-another-var