Reshaping from long to wide with some missing data (NA's) on time invariant variables

前端 未结 2 1339
死守一世寂寞
死守一世寂寞 2021-01-16 18:13

When using stats:::reshape() from base to convert data from long to wide format, for any variables designated as time invariant, reshape just takes

相关标签:
2条回答
  • 2021-01-16 18:33

    I don't know how to fix the problem but one way to fix the symptom would be to push the NA values down in the order.

    testdata <- testdata[order(testdata$timeinvariant),]
    testdata
    #  id process1 timeinvariant time
    #3  2      3.5             4    1
    #2  1      4.0             6    2
    #1  1      3.0            NA    1
    reshaped<-reshape(testdata,v.names="process1",direction="wide")
    reshaped
    #  id timeinvariant process1.1 process1.2
    #3  2             4        3.5         NA
    #2  1             6        3.0          4
    

    A more general solution will be to make sure there is only one value in the timevariant column per id

    testdata$timeinvariant <- apply(testdata,1,function(x) max(testdata[testdata$id == x[1],"timeinvariant"],na.rm=T))
    testdata
    #  id process1 timeinvariant time
    #3  2      3.5             4    1
    #2  1      4.0             6    2
    #1  1      3.0             6    1
    

    This can be repeated for any number of columns before calling the reshape function. Hope this helps

    0 讨论(0)
  • 2021-01-16 18:47

    If there's always at least one non-missing value of timeinvariant for each id, and all (non-missing) values of timeinvariant are identical for each id (since it's time-invariant), couldn't you create a new column that fills in the NA values in timeinvariant and then reshape using that column? For example:

    # Add another row to your data frame so that we'll have 2 NA values to deal with
    td <- data.frame(matrix(c(1,1,2,1,3,4,3.5,4.5,NA,6,4,NA,1,2,1,3), nrow = 4))
    colnames(td) <- c("id", "process1", "timeinvariant", "time")
    
    # Create new column timeinvariant2, which fills in NAs from timeinvariant,
    # then reshape using that column
    library(dplyr)
    td.wide = td %>%
      group_by(id) %>%
      mutate(timeinvariant2=max(timeinvariant, na.rm=TRUE)) %>%
      dcast(id + timeinvariant2 ~ time, value.var='process1')
    
    # Paste "process1." onto names of all "time" columns
    names(td.wide) = gsub("(^[0-9]*$)", "process1\\.\\1", names(td.wide) )
    
    td.wide
    
      id timeinvariant2 process1.1 process1.2 process1.3
    1  1              6        3.0          4        4.5
    2  2              4        3.5         NA         NA
    
    0 讨论(0)
提交回复
热议问题