R Create a time sequence as xts index based on two columns in data.frame

匿名 (未验证) 提交于 2019-12-03 10:24:21

问题:

I have a data.frame like below

    soc_sec group_count total_creds group_start  group_end        (chr)       (int)       (dbl)      (date)     (date) 1  AA2105480           5        14.0  2005-01-09 2005-05-16 2  AA2105480           7        17.0  2004-08-26 2004-12-10 3  AB4378973           1         0.0  2004-01-21 2004-05-07 4  AB4990257           2         1.0  2014-09-01 2014-12-14 5  AB7777777           5        12.0  2004-01-21 2005-03-22 6  AB7777777           6        15.0  2004-08-26 2004-12-10 7  AB7777777           5        15.0  2005-01-09 2005-05-12 8  AC4285291           2         3.0  2014-09-01 2014-12-14 9  AC4285291           1         3.0  2015-01-12 2015-04-15 10 AC6039874           9        17.5  2010-01-06 2010-05-06 11 AC6039874           7        16.0  2011-01-05 2011-04-29 12 AC6039874           8        12.5  2010-08-31 2010-12-21 13 AC6039874           7        13.5  2011-08-31 2011-12-21 14 AC6547645           7        18.0  2005-01-09 2005-05-12 15 AC6547645           6        17.0  2004-08-26 2004-12-10 16 AC6547645           1         2.0  2005-04-20 2005-06-01 17 AD1418577           7        13.0  2013-01-09 2013-05-17 18 AD1418577           8        16.0  2013-08-28 2013-12-13 19 AD1418577           6        15.0  2014-01-08 2014-05-05 20 AD1418577           7        13.0  2015-08-26 2015-12-15 

What I'm trying to do is create a column that I can later use as a day-by-day index for an xts object based on the sequence of days between group_start and group_end. I know I'm able to calculate a vector for one column using v <- seq(df$group_start[1], df$group_end[1], by="days") I can even make the requisite repetition of the rows that I could later dplyr::bind_rows(df,v) with:

df$len <- apply(df, 1, function(x){     length(seq(as.Date(x["group_start"]), as.Date(x["group_end"]), by="days"))    }) df <- df[rep(seq_len(nrow(df)), df$len),] 

What I have been unable to do is vectorize this to occur for each row in the data.frame.

Things I've tried that do not work

create_date_vector <- function(x){    flog.debug("id: %s", x["soc_sec"])    seq(as.Date(x["group_start"]), as.Date(x["group_end"]), by = "days")  }  date_vec <- c()  date_vec <- c(date_vec, apply(df, 1, create_date_vector)) 

error with : Error in seq.int(0, to0 - from, by) : wrong sign in 'by' argument

date_vec <- c() for(i in 1:nrow(df)){       date_vec <- c(date_vec, seq(from=as.Date(df$group_start[as.integer(i)]), to=as.Date(df$group_end[as.integer(i)])), by="days")     } 

error with : Error in seq.Date(from = as.Date(ags_df$group_start[as.integer(i)]), to = as.Date(ags_df$group_end[as.integer(i)])) : exactly two of 'to', 'by' and 'length.out' / 'along.with' must be specified

Any help would be greatly appreciated. Thank you.

dput

structure(list(soc_sec = c("AA2105480", "AA2105480", "AB4378973",  "AB4990257", "AB7777777", "AB7777777", "AB7777777", "AC4285291",  "AC4285291", "AC6039874", "AC6039874", "AC6039874", "AC6039874",  "AC6547645", "AC6547645", "AC6547645", "AD1418577", "AD1418577",  "AD1418577", "AD1418577"), group_count = c(5L, 7L, 1L, 2L, 5L,  6L, 5L, 2L, 1L, 9L, 7L, 8L, 7L, 7L, 6L, 1L, 7L, 8L, 6L, 7L),      total_creds = c(14, 17, 0, 1, 12, 15, 15, 3, 3, 17.5, 16,      12.5, 13.5, 18, 17, 2, 13, 16, 15, 13), group_start = structure(c(12792,      12656, 12438, 16314, 12438, 12656, 12792, 16314, 16447, 14615,      14979, 14852, 15217, 12792, 12656, 12893, 15714, 15945, 16078,      16673), class = "Date"), group_end = structure(c(12919, 12762,      12545, 16418, 12864, 12762, 12915, 16418, 16540, 14735, 15093,      14964, 15329, 12915, 12762, 12935, 15842, 16052, 16195, 16784     ), class = "Date")), class = c("tbl_df", "data.frame"), row.names = c(NA,  -20L), .Names = c("soc_sec", "group_count", "total_creds", "group_start",  "group_end")) 

回答1:

I don't know how useful this is more than a month after you found a workable solution, but I had a go at whittling your code down to something a bit more compact.

library(dplyr)  df <- structure(list(soc_sec = c("AA2105480", "AA2105480", "AB4378973",  "AB4990257", "AB7777777", "AB7777777", "AB7777777", "AC4285291",  "AC4285291", "AC6039874", "AC6039874", "AC6039874", "AC6039874",  "AC6547645", "AC6547645", "AC6547645", "AD1418577", "AD1418577",  "AD1418577", "AD1418577"), group_count = c(5L, 7L, 1L, 2L, 5L,  6L, 5L, 2L, 1L, 9L, 7L, 8L, 7L, 7L, 6L, 1L, 7L, 8L, 6L, 7L),      total_creds = c(14, 17, 0, 1, 12, 15, 15, 3, 3, 17.5, 16,      12.5, 13.5, 18, 17, 2, 13, 16, 15, 13), group_start = structure(c(12792,      12656, 12438, 16314, 12438, 12656, 12792, 16314, 16447, 14615,      14979, 14852, 15217, 12792, 12656, 12893, 15714, 15945, 16078,      16673), class = "Date"), group_end = structure(c(12919, 12762,      12545, 16418, 12864, 12762, 12915, 16418, 16540, 14735, 15093,      14964, 15329, 12915, 12762, 12935, 15842, 16052, 16195, 16784     ), class = "Date")), .Names = c("soc_sec", "group_count",  "total_creds", "group_start", "group_end"), class = c("tbl_df",  "data.frame"), row.names = c(NA, -20L))   # Essentially the same as the calc_day_nums() and apply() part of # your solution. It returns an object of class difftime, but that # doesn't seem to cause any problems diffs <- abs(with(df, group_start-group_end))+1  # This will repeat row[i] diffs[i] number of times df.rep <- df[rep(1:nrow(df), times=diffs), ] reps <- rep(diffs, times=diffs)  # Creating the time sequences. Many ways to skin this cat, I suspect. # This is but one dates.l <- apply(   df[colnames(df) %in% c("group_start", "group_end")], 1,    function(x) {         seq(min(as.Date(x)), max(as.Date(x)), by="days")   })  # Converting the list into one long vector. Essentially the same as # unlist(), except it retains the Date class. days <- do.call(c, dates.l)  # Combining the elements by column df.long <- cbind(df.rep, reps, days)  str(df.long)  # dplyr isn't exactly my forte. This is just to convert the output # into the same tbl format as the input library(tibble)  df.long <- as_tibble(df.long) 


回答2:

So, I managed to figure it out, and I figure I should put the solution down here just in case. It took multiple steps, so if any one can think of a better way to do this please let me know.

First, I created a column to count the number of days between the 2 dates. I needed this so that I knew how many repetitions of each row to make

calc_day_nums <- function(x){   if(as.numeric(as.Date(x["group_start"])) < as.numeric(as.Date(x["group_end"]))){     len <- length(seq(as.Date(x["group_start"]), as.Date(x["group_end"]), by="days"))   } else if (as.numeric(as.Date(x["group_start"])) > as.numeric(as.Date(x["group_end"]))){     len <- length(seq(as.Date(x["group_end"]), as.Date(x["group_start"]), by="days"))   } else {     len <- 1 #basically these are records whose start and end are the same   }   return(len) } df$reps <- apply(df, 1, calc_day_nums) 

Then, I created a vector of all the days themselves

date_vec <- function(i, x, y){   if(as.Date(x[i]) != as.Date(y[i])){     as.Date(as.Date(x[i]):as.Date(y[i]), origin="1970-01-01")   } else{     as.Date(x[i])   } } vec <- lapply(seq_along(df$group_start), date_vec, x=df$group_start, y=df$group_end) vec <- unlist(vec) vec <- as.Date(vec) 

After that, I made the correct number of row repetitions to the data.frame

df <- df[rep(seq_len(nrow(df)), df$reps),] 

Lastly, I bound the vector to the data.frame. At this point I could have also just defined the vec as the xts index xt <- xts(x = df, order.by = vec), but I wanted to add it to the data.frame

df <- bind_cols(df, data.frame(days=vec)) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!