Reshaping from long to wide with some missing data (NA's) on time invariant variables

前端未结

关注

 2  1339

When using stats:::reshape() from base to convert data from long to wide format, for any variables designated as time invariant, reshape just takes

相关标签:

2条回答

刺人心

2021-01-16 18:33

I don't know how to fix the problem but one way to fix the symptom would be to push the NA values down in the order.

testdata <- testdata[order(testdata$timeinvariant),]
testdata
#  id process1 timeinvariant time
#3  2      3.5             4    1
#2  1      4.0             6    2
#1  1      3.0            NA    1
reshaped<-reshape(testdata,v.names="process1",direction="wide")
reshaped
#  id timeinvariant process1.1 process1.2
#3  2             4        3.5         NA
#2  1             6        3.0          4

A more general solution will be to make sure there is only one value in the timevariant column per id

testdata$timeinvariant <- apply(testdata,1,function(x) max(testdata[testdata$id == x[1],"timeinvariant"],na.rm=T))
testdata
#  id process1 timeinvariant time
#3  2      3.5             4    1
#2  1      4.0             6    2
#1  1      3.0             6    1

This can be repeated for any number of columns before calling the reshape function. Hope this helps

0 讨论(0)

忘掉有多难

2021-01-16 18:47

If there's always at least one non-missing value of timeinvariant for each id, and all (non-missing) values of timeinvariant are identical for each id (since it's time-invariant), couldn't you create a new column that fills in the NA values in timeinvariant and then reshape using that column? For example:

# Add another row to your data frame so that we'll have 2 NA values to deal with
td <- data.frame(matrix(c(1,1,2,1,3,4,3.5,4.5,NA,6,4,NA,1,2,1,3), nrow = 4))
colnames(td) <- c("id", "process1", "timeinvariant", "time")

# Create new column timeinvariant2, which fills in NAs from timeinvariant,
# then reshape using that column
library(dplyr)
td.wide = td %>%
  group_by(id) %>%
  mutate(timeinvariant2=max(timeinvariant, na.rm=TRUE)) %>%
  dcast(id + timeinvariant2 ~ time, value.var='process1')

# Paste "process1." onto names of all "time" columns
names(td.wide) = gsub("(^[0-9]*$)", "process1\\.\\1", names(td.wide) )

td.wide

  id timeinvariant2 process1.1 process1.2 process1.3
1  1              6        3.0          4        4.5
2  2              4        3.5         NA         NA

0 讨论(0)