How to create conditional dummies “before the event” with dplyr in R?

问题

I'm trying create a condition dummy (X) with the rule

set X=1 if Y=1 the last two years before the NA (only count once!).

To give an example: this is a sample from my data:

year    country Y
1990    Bahamas 1
1991    Bahamas NA
1992    Bahamas NA
1993    Bahamas 0
1994    Bahamas 1
1995    Bahamas 1
1996    Bahamas NA
1997    Bahamas 1
1998    Bahamas NA
1999    Bahamas 1
2000    Bahamas NA
2001    Bahamas 1
2002    Bahamas 1
2003    Bahamas 0
2004    Bahamas NA
2005    Bahamas 0
2006    Bahamas 0
2007    Bahamas 1
2008    Bahamas NA
2009    Bahamas 1
2010    Bahamas 1
2011    Bahamas 1

And here is how the X dummy should look like:

year    country Y   X1
1990    Bahamas 1   1
1991    Bahamas NA  0
1992    Bahamas NA  0
1993    Bahamas 0   0
1994    Bahamas 1   1
1995    Bahamas 1   0
1996    Bahamas NA  0
1997    Bahamas 1   1
1998    Bahamas NA  0
1999    Bahamas 1   1
2000    Bahamas NA  0
2001    Bahamas 1   1
2002    Bahamas 1   0
2003    Bahamas 0   0
2004    Bahamas NA  0
2005    Bahamas 0   0
2006    Bahamas 0   0
2007    Bahamas 1   1
2008    Bahamas NA  0
2009    Bahamas 1   0
2010    Bahamas 1   0
2011    Bahamas 1   0

This is a bit too complicated for me. I've been reading about dplyr which seems to be a relevant package here. My readings has so far taken me to this cod

df %>% mutate(X=ifelse(Y >0) & lag(Y,2,))

I get the error:

argument "yes" is missing, with no default

Please tell me what am I doing wrong here. Should I put the "ifelse" before the "lag" as well?

Thanks.

回答1:

library(dplyr)

dat <- readr::read_table(
"year    country Y
1990    Bahamas 1
1991    Bahamas NA
1992    Bahamas NA
1993    Bahamas 0
1994    Bahamas 1
1995    Bahamas 1
1996    Bahamas NA
1997    Bahamas 1
1998    Bahamas NA
1999    Bahamas 1
2000    Bahamas NA
2001    Bahamas 1
2002    Bahamas 1
2003    Bahamas 0
2004    Bahamas NA
2005    Bahamas 0
2006    Bahamas 0
2007    Bahamas 1
2008    Bahamas NA
2009    Bahamas 1
2010    Bahamas 1
2011    Bahamas 1
")

expected_output <- readr::read_table(
"year    country Y   X1
1990    Bahamas 1   1
1991    Bahamas NA  0
1992    Bahamas NA  0
1993    Bahamas 0   0
1994    Bahamas 1   1
1995    Bahamas 1   0
1996    Bahamas NA  0
1997    Bahamas 1   1
1998    Bahamas NA  0
1999    Bahamas 1   1
2000    Bahamas NA  0
2001    Bahamas 1   1
2002    Bahamas 1   0
2003    Bahamas 0   0
2004    Bahamas NA  0
2005    Bahamas 0   0
2006    Bahamas 0   0
2007    Bahamas 1   1
2008    Bahamas NA  0
2009    Bahamas 1   0
2010    Bahamas 1   0
2011    Bahamas 1   0
")

Identify the groups ending with NA, find the position of the first 1 in the Y column, create the X1 column with 1s in found positions:

res <-
  dat %>% 
  group_by(country) %>% 
  group_by(grp = cumsum(is.na(lag(Y))), add = TRUE) %>% 
  mutate(first_year_at_1 = match(1, Y) * any(is.na(Y)) * any(tail(Y, 3) == 1L), 
         X1 = {x <- integer(length(Y)) ; x[first_year_at_1] <- 1L ; x}) %>% 
  ungroup()

all.equal(select(res, -grp, -first_year_at_1), expected_output)

# [1] TRUE

(Note: if there are different countries in the real dataset, you might want to group by country first to avoid undesirable effects at the junction of countries. I edited my answer accordingly).

回答2:

A solution can be found using dplyr package. The approach is to create a group ending with NA. Then the first row with for a group having Y == 1 and that group's last Y is NA then x1 is set as 1 otherwise X1 will be set as 0.

library(dplyr)

df %>% group_by(Grp = cumsum(is.na(lag(Y))))  %>%
  mutate(X1 = ifelse(row_number()== min(which(Y==1)) & is.na(last(Y)) , 1, 0 )) %>%
  ungroup() %>%
  select(-Grp) %>%
  as.data.frame()


#    year country  Y X1
# 1  1990 Bahamas  1  1
# 2  1991 Bahamas NA  0
# 3  1992 Bahamas NA  0
# 4  1993 Bahamas  0  0
# 5  1994 Bahamas  1  1
# 6  1995 Bahamas  1  0
# 7  1996 Bahamas NA  0
# 8  1997 Bahamas  1  1
# 9  1998 Bahamas NA  0
# 10 1999 Bahamas  1  1
# 11 2000 Bahamas NA  0
# 12 2001 Bahamas  1  1
# 13 2002 Bahamas  1  0
# 14 2003 Bahamas  0  0
# 15 2004 Bahamas NA  0
# 16 2005 Bahamas  0  0
# 17 2006 Bahamas  0  0
# 18 2007 Bahamas  1  1
# 19 2008 Bahamas NA  0
# 20 2009 Bahamas  1  0
# 21 2010 Bahamas  1  0
# 22 2011 Bahamas  1  0
# 
#

Data:

df <- read.table(text = 
"year    country Y
1990    Bahamas 1
1991    Bahamas NA
1992    Bahamas NA
1993    Bahamas 0
1994    Bahamas 1
1995    Bahamas 1
1996    Bahamas NA
1997    Bahamas 1
1998    Bahamas NA
1999    Bahamas 1
2000    Bahamas NA
2001    Bahamas 1
2002    Bahamas 1
2003    Bahamas 0
2004    Bahamas NA
2005    Bahamas 0
2006    Bahamas 0
2007    Bahamas 1
2008    Bahamas NA
2009    Bahamas 1
2010    Bahamas 1
2011    Bahamas 1",
header = TRUE, stringsAsFactors = FALSE)

来源：https://stackoverflow.com/questions/50556268/how-to-create-conditional-dummies-before-the-event-with-dplyr-in-r

标签

dplyr

data.table

plyr