Data Transformation in R for Panel Regression

萝らか妹 提交于 2019-12-08 07:43:00

问题


I really need your help regarding a problem which may seem easy to solve for you.

Currently I work on a project which involves some panel-regressions. I have several large csv-files (up to 12 million entries per sheet) which are formatted as in the picture attached, whereas the columns (V1, V2) are individuals and the rows (1, 2, 3) are time identifiers.

In order to use the plm()-function I need all these files to convert to the following data structure:

ID Time X1 X2
1 1 x1 x2
1 2 x1 x2
1 ... ... ...
2 1 x1 x2
2 2 ... ...

I really struggle with this transformation and I'm really frustrated right now i.e. where do I get the identifier and the time index from? Would really appreciate if you could provide me with information how to solve this problem.

If my question is not clear to you, just ask.

Best regards and thanks in advance

The output should look like as follows:


回答1:


 mydata<-structure(list(V1 = 10:13, V2 = 21:24, V3 = c(31L, 32L, 3L, 34L
    )), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, 
    -4L))

> mydata
  V1 V2 V3
1 10 21 31
2 11 22 32
3 12 23  3
4 13 24 34

The following code can be used for your data without changing anything. For illustration, I used just the above data. I used the base R reshape function

long <- reshape(mydata, idvar = "time", ids = row.names(mydata),
                times = names(mydata), timevar = "id",
                varying = list(names(mydata)),v.names="value", new.row.names = 1:((dim(mydata)[2])*(dim(mydata)[1])),direction = "long")

> long
   id value time
1  V1    10    1
2  V1    11    2
3  V1    12    3
4  V1    13    4
5  V2    21    1
6  V2    22    2
7  V2    23    3
8  V2    24    4
9  V3    31    1
10 V3    32    2
11 V3     3    3
12 V3    34    4
long$id<-substr(long$id,2,4) # 4 is used to take into account your 416 variables
myout<-long[,c(1,3,2)]
> myout
   id time value
1   1    1    10
2   1    2    11
3   1    3    12
4   1    4    13
5   2    1    21
6   2    2    22
7   2    3    23
8   2    4    24
9   3    1    31
10  3    2    32
11  3    3     3
12  3    4    34



回答2:


Here is an alternative: Use Stacked from my "splitstackshape" package.

Here it is applied on @Metrics's sample data:

# install.packages("splitstackshape")
library(splitstackshape)
Stacked(cbind(id = 1:nrow(mydata), mydata), 
        id.vars="id", var.stubs="V", sep = "V")
#     id .time_1  V
#  1:  1       1 10
#  2:  1       2 21
#  3:  1       3 31
#  4:  2       1 11
#  5:  2       2 22
#  6:  2       3 32
#  7:  3       1 12
#  8:  3       2 23
#  9:  3       3  3
# 10:  4       1 13
# 11:  4       2 24
# 12:  4       3 34

It would be very fast if your data are large. Here are the speeds for the 12MB dataset you linked to. The sorting is different but the data are the same.

It still isn't faster than stack though (but at some point, stack starts to slow down).

See the system.times below:

reshape()

system.time(out <- reshape(x, idvar = "time", ids = row.names(x),
                           times = names(x), timevar = "id",
                           varying = list(names(x)), 
                           v.names="value",
                           new.row.names = 1:prod(dim(x)), 
                           direction = "long"))
#    user  system elapsed 
#   53.11    0.00   53.11 
head(out)
#   id        value time
# 1 V1  0.003808635    1
# 2 V1 -0.018807416    2
# 3 V1  0.008875447    3
# 4 V1  0.001148695    4
# 5 V1 -0.019365004    5
# 6 V1  0.012436560    6

Stacked()

system.time(out2 <- Stacked(cbind(id = 1:nrow(x), x), 
                            id.vars="id", var.stubs="V", 
                            sep = "V"))
#    user  system elapsed 
#    0.30    0.00    0.29 

out2
#           id .time_1            V
#      1:    1       1  0.003808635
#      2:    1      10 -0.014184635
#      3:    1     100 -0.013341843
#      4:    1     101  0.006784138
#      5:    1     102  0.006463707
#     ---                          
# 963868: 2317      95  0.009569451
# 963869: 2317      96  0.002497771
# 963870: 2317      97  0.009202519
# 963871: 2317      98  0.017007545
# 963872: 2317      99 -0.002495842

stack()

system.time(out3 <- cbind(id = 1:nrow(x), stack(x)))
#    user  system elapsed 
#    0.09    0.00    0.09
head(out3)
#   id       values ind
# 1  1  0.003808635  V1
# 2  2 -0.018807416  V1
# 3  3  0.008875447  V1
# 4  4  0.001148695  V1
# 5  5 -0.019365004  V1
# 6  6  0.012436560  V1


来源:https://stackoverflow.com/questions/18541269/data-transformation-in-r-for-panel-regression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!