split up xts-dataframe into several groups, collapse to weekly data and keep the time index

问题

I am a total newbie to R so I apologize if the answer to my question is too obvious. I a data set of the following form:

Date, V1,V...,VN,Land,Nace
22/03/1995 23:01:12,1,3,2,15,A
21/03/1995 21:01:12,3,3,1,9,C
1/04/1995 17:01:06,3,2,1,3,B

I would like to analyze the data in the data.frame by Land, NACE (it's an industry code), Date (I would like to collapse the whole think to weekly data) and by the three different answering options {1,2,3} in V1...VN. This is a sample of my data:

example <- as.data.frame(structure(c(" 1", " 2", " 1", " 2", " 1", " 1", " 2", " 1", " 2", 
" 1", " 2", " 3", " 1", " 1", " 2", " 2", " 3", " 1", " 2", " 2", 
" 1", " 2", " 1", " 1", " 2", NA, " 2", NA, NA, " 1", " 3", " 1", 
" 3", " 3", " 2", " 3", " 3", " 3", " 2", " 2", " 2", " 3", " 3", 
" 3", " 2", " 2", " 3", " 3", " 3", " 3", " 1", " 2", " 1", " 2", 
" 2", " 1", " 2", " 1", " 2", " 2", " 2", " 3", " 1", " 1", " 2", 
" 2", " 3", " 3", " 2", " 2", " 1", " 2", " 1", " 1", " 2", NA, 
" 2", NA, NA, " 1", " 3", " 2", " 3", " 2", " 0", " 3", " 3", 
" 3", " 2", " 0", " 2", " 3", " 3", " 3", " 0", " 2", " 2", " 3", 
" 3", " 0", "12", " 5", " 9", "14", " 5", "tra", "tra", "man", 
"inf", "agc", "07-2011", "07-2011", "07-2011", "07-2011", "07-2011" 
), .indexCLASS = c("POSIXlt", "POSIXt"), .indexTZ = "", class = c("xts", 
"zoo"), .indexFORMAT = "%U-%Y", index = structure(c(1297642226, 
1297672737, 1297741204, 1297748893, 1297749513), tzone = "", tclass = c("POSIXlt", 
"POSIXt")), .Dim = c(5L, 23L), .Dimnames = list(NULL, c("rev_sit", 
"prof_sit", "emp_nr_sit", "inv_sit", "ord_home_sit", "ord_abr_sit", 
"emp_cost_sit", "usage_cost_sit", "tax_cost_sit", "gov_cost_sit", 
"rev_exp", "prof_exp", "emp_nr_exp", "inv_exp", "ord_home_exp", 
"ord_abr_exp", "emp_cost_exp", "usage_cost_exp", "tax_cost_exp", 
"gov_cost_exp", "land", "nace", "index"))))

prof_sit etc. are questions and below there are the answer on the scale 1,2,3. land, nace and index (that's the time index) are the variables with the help of which i would like to split up the dataset. The goal is to get an xts data frame which would look like:

-,nace.land,nace.land,nace.land,...
10-1995,sum of answers coded i.e. as 1 for a certain nace and a certain land,sum,sum,...  
11-1995,sum,sum,...
12-1995,sum,sum,...

where 12-1995 is the 12th calendar week in 1995. The nearest I came to this solution was with tapply:

pos <- as.data.frame(tapply((example[,1]==3)*1,
  list(example$index, example$land, example$nace), sum))

It does more or less what I want with the downturn that the format as an xts object is lost and thus that the rows are not in the right order. as a second disadavantage it might also be mentioned that i will to run loops to apply the same technique for all the twenty questions. does anybody know a solution for that problem? i appreciate every help or hint since i am wasting my time on this problem since several days now.

best regards,

andreas

回答1:

First, remember that an xts object is a matrix underneath, not a data frame. That means all numeric, or all character. In your case you should convert your "Nace" code to a number.

Here is some abstract test data:

library(xts)
x=xts(data.frame(value=1:100,code=rep(1:5,20)),seq(as.Date('2011-01-01'),by=1,length.out=100))

A simplistic function is not what you want:

apply.weekly(x,sum)
#           [,1]
#2011-01-03   12
#2011-01-10   73
#...

Both columns have been summed together! So, when dealing with multiple columns you want:

apply.weekly(x,function(d){c(sum(d$value),sum(d$code))})
#           value code
#2011-01-03     6    6
#2011-01-10    49   24
#2011-01-17    98   18
#...

You can get as complex as you want in the function, and can introduce new columns, e.g.

apply.weekly(x,function(d){
  c(sum_value = sum(d$value),
   mean_value = mean(d$value),
        A_cnt = sum(d$code==1),
        B_cnt = sum(d$code==2)
   )
  })
#           sum_value mean_value A_cnt B_cnt
#2011-01-03         6        2.0     1     1
#2011-01-10        49        7.0     1     1
#2011-01-17        98       14.0     2     2
#...

What about when you want to do further grouping? This example shows how to sum values, grouped by week then grouped by type:

apply.weekly(x,function(d){
  t=tapply(d$value,d$code,sum)
  c(A=t[1],B=t[2],C=t[3],D=t[4],E=t[5])
  })
#           A.1 B.2 C.3 D.NA E.NA
#2011-01-03   1   2   3   NA   NA
#2011-01-10   6   7   8   13   15
#2011-01-17  27  29  13   14   15
#...
#2011-04-10  96  97  98   99  195

Excuse: I've taken this chance to label the columns. Real reason: I cannot work out how to return the return value of tapply in a way that XTS's apply.weekly() likes. I've tried simplify=F, and unlist(t), as.vector(t), as.matrix(t), each with/without a c() wrapper. If/when I work this out, I'll update this answer.

To get a feel for what you can do in an apply.weekly() callback, you can use print commands.

apply.weekly(x,function(d){
  print(d);str(d)
  })

It complains because we're not returning something useful, but it is fine for interactive development. d is an xts object. You can turn it into a data.frame with df=as.data.frame(d).

回答2:

thank you very much for all your help. I was busy with some other stuff in the meanwhile but now I was working on my problem again, and with the help of your great comments I have found a solution:

I gave up working directly with time series, postponing this step to the end of my analysis. Therefore I take the date vector and transform it into weeks:

library(ISOweek)

d$index <- ISOweek(d$date)

(i do this with ISOweek since I am using Windows)

then I use a combination of tapply and lapply. The following function calculates the number of positive answers in the survey (coded by 1) for every calendar week (d$index = t[[22]]) and every combination of the two categorical columns t[[21]], t[[22]]. In the same step the whole thing is transformed into a data frame:

groupweeksums <- function(x,t){as.data.frame(tapply((x==1)*1,list(t[[23]],t[[21]],t[[22]]), function(d)sum(d,na.rm=TRUE)))}

x stands for the specific column,
t for the data frame (i didn't know how to do that otherwise because at one point i have to address a column at the other the data frame and I wanted to avoid lots of typing) ;

if d is the data frame then:

df <- groupweeksums(d,d)

in order that I don't have to repeat this procedure for all of my 20 questions is use lapply:

df <- as.data.frame(lapply(df[,1:20],function(d)groupweeksums(d,euwifo)))

This gives me a beautiful data frame with all I need for further analysis. Thanks for your help, with your helpful comments I came closer and closer to the solution!!!

P.S. I will also post this answer to the other question I posted on stackoverflow which was connected to this first one.

来源：https://stackoverflow.com/questions/9002995/split-up-xts-dataframe-into-several-groups-collapse-to-weekly-data-and-keep-the

标签

time-series

xts