Applying non-trivial functions to ordered subsets of data.table

问题

Problem

I'm trying to use my newfound data.table powers (for good) to compute the frequency content of a bunch of data that looks like this:

|  Sample|  Channel|  Trial|     Voltage|Class  |  Subject|
|-------:|--------:|------:|-----------:|:------|--------:|
|       1|        1|      1|  -196.82253|1      |        1|
|       1|        2|      1|   488.15166|1      |        1|
|       1|        3|      1|  -311.92386|1      |        1|
|       1|        4|      1|  -297.06078|1      |        1|
|       1|        5|      1|  -244.95824|1      |        1|
|       1|        6|      1|  -265.96525|1      |        1|
|       1|        7|      1|  -258.93263|1      |        1|
|       1|        8|      1|  -224.07819|1      |        1|
|       1|        9|      1|   -87.06051|1      |        1|
|       1|       10|      1|  -183.72961|1      |        1|

There are about 57 million rows--every variable is an integer except Voltage. Sample is an index that goes from 1:350, and Channel goes from 1:118. There are 280 Trials.

sample data

Martín's example data is valid, I believe (the numbers of categorical variables are a non-issue with respect to the errors):

big.table <- data.table(Sample = 1:350, Channel = 1:118, Trial = letters,
             Voltage = rnorm(10e5, -150, 100), Class = LETTERS, Subject = 1:20)

process

The first thing I do is set the key to Sample, because I want anything I do to the individual data series to happen in a sane order:

setkey(big.table,Sample)

Then, I do some filtering on the Voltage signals to remove high frequencies. (The filtering function returns a vector of the same length as its second argument):

require(signal)
high.pass <- cheby1(cheb1ord(Wp = 0.14, Ws = 0.0156, Rp = 0.5, Rs = 10))
big.table[,Voltage:=filtfilt(high.pass,Voltage),by=Subject]

initial error

I'd like to see if that processed it properly (i.e. Subject by Subject, Trial by Trial, Channel by Channel, in Sample order), so I add a column containing the spectral content of the Voltage column:

get.spectrum <- function(x) {
    spec.obj <- spectrum(x,method="ar",plot=FALSE)
    outlist <- list()
    outlist$spec <- 20*log10(spec.obj$spec)
    outlist$freq <- spec.obj$freq
    return(outlist)
  }
big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),by=Subject]

Error: cannot allocate vector of size 6.1 Gb

I think the issue is that get.spectrum() is trying to eat the whole column at once, considering that the whole table is only around 1.7GB. Is that so? What are my options?

What have you tried?

Increasing the granularity of grouping

If I make a call to get.spectrum including all of the columns I want to group by, I get a more promising error:

big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),
        by=c("Subject","Trial","Channel","Sample")]

Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action,  : 
  'order.max' must be >= 1

That implies the spectrum() function I'm calling is getting data of the wrong shape.

Cutting points down, trying different 'where' conditions

Following Roland's advice, I cut the number of points to around 20 million and tried the below:

big.table[,"Spectrum":=get.spectrum(Voltage),
        by=c("Subject","Trial","Channel")]

Error in `[.data.table`(big.table, , `:=`("Spectrum", get.spectrum(Voltage)),  :
  All items in j=list(...) should be atomic vectors or lists. If you are trying something like
  j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge 
  afterwards.

My thinking was that I shouldn't group by Sample since I want to apply this function to each group of 350 Samples given by the above by vector.

Improving on that with some things gleaned from section 2.16 of the data.table FAQ, I added the SQL equivalent of an ORDER BY. I know that the Sample column needs to go from 1:350 for each input to the spectrum() function:

> big.table[Sample==c(1:350),c("Spectrum","Frequency"):=as.list(get.spectrum(Voltage)),
+             by=c("Subject","Trial","Channel")]
Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action,  : 
  'order.max' must be >= 1

Again, I run into trouble with non-unique inputs.

回答1:

Perhaps this can start to solve the problem:

I believe the error data.table gives is because get.spectrum returns a list with:
spec and freq.

Using this example dataset:
big.table <- data.table(Sample = 1:350, Channel = 1:118, Trial = letters,
                 Voltage = rnorm(10e5, -150, 100), Class = LETTERS, Subject = 1:20)

str(big.table)
setkey(big.table,Sample)

get.spectrum <- function(x) {
  spec.obj <- spectrum(x,method="ar",plot=FALSE)
  outlist <- list()
  outlist$spec <- 20*log10(spec.obj$spec)
  outlist$freq <- spec.obj$freq
  return(outlist)
}

VT <- get.spectrum(big.table$Voltage)
str(VT)

# Then you should decide which value you would like to inset in big.table
get.spectrum(big.table$Voltage)$spec
# or
get.spectrum(big.table$Voltage)$freq

This should work. You can also use set()

big.table[, Spectrum:= get.spectrum(Voltage)$spec, by=Subject]
big.table[, Frequency:= get.spectrum(Voltage)$freq, by=Subject]

EDIT As mentioned in the comments, i've tried to provide an answer using set() but i don't see how to "group by" Subject: Here is what i've tried, not sure if it's the intended answer.

cols = c("spec", "freq")
for(inx in cols){
  set(big.table, i=NULL, j=j ,value = get.spectrum(big.table[["Voltage"]])[inx])
}

EDIT2 Two functions one for each colum. Using a different combination of group by variables.

spec_fun <- function(x) {
  spec.obj <- spectrum(x,method="ar",plot=FALSE)
  spec <- 20*log10(spec.obj$spec)
  spec
}

freq_fun <- function(x) {
  freq <- spectrum(x,method="ar",plot=FALSE)$freq
  freq
}

big.table[, Spectrum:= spec_fun(Voltage), by=c("Subject","Trial","Channel")]
big.table[, Frequency:= freq_fun(Voltage), by=c("Subject","Trial","Channel")]

# It gives some warnings(), probaby because of the made up data.

回答2:

After some extended discussion with Martín Bel who was patient enough to listen to me thrash, I was able to work out some of what was going wrong.

initial error

A major issue is that spectrum(), the function being called on each time-series component of the data.table, expects a 2D structure representing a multivariate time series (in this case, channels x samples). So this call

big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),by=Subject]

Error: cannot allocate vector of size 6.1 Gb

is totally bad.

brute 'for'ce

Here is a slow way to do it using (mostly useless) parallelization. get.spectrum() is modified to return a simple vector, which was related to the third error on return types from j:

get.spectrum <- function(x) {
    spec.obj <- spectrum(x,method="ar",plot=FALSE)
    outlist <- list()
    outlist <- 20*log10(spec.obj$spec)
    # outlist$freq <- spec.obj$freq # don't return me
    return(outlist)
}

require(parallel)
require(foreach)
freq.bins <- 500
spectra <- foreach(s.ind = unique(big.table$Subject), .combine=rbind) %:% {
              foreach(t.ind = unique(big.table$Trial), .combine=rbind) %dopar% {

                cbind((sampling.rate * (seq_len(freq.bins)-1) / sampling.rate),
                  rep(c.ind,freq.bins),
                  rep(t.ind,freq.bins),
                  get.spectrum((subset(big.table, 
                   subset=(Subject==s.ind & 
                             Trial==t.ind),
                   select=Voltage))$Voltage),
                  rep(s.ind,freq.bins))

              }
            }

This gives the right result because each input to get.spectrum() is a subset where Subject and Trial are fixed, leaving Channel and Sample to vary. However, it is quite slow, and spends over 80% of the computational load in 1 of the 4 cores I have on this machine.

data.table approach

I went back to some toy cases that came up in the discussion, and tried this again:

spec.dt <- big.table[,get.spectrum(Voltage),by=c("Subject","Trial")]

This is close! It returns a data.table of almost the right structure.

> str(spec.dt)
Classes ‘data.table’ and 'data.frame':  140000 obs. of  3 variables:
 $ Subject: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Trial  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V1     : num  110.7 109 105.4 101.6 98.2 ...

However, the Channel variable is missing. Easily fixed:

> spec.dt <- erp.table[,get.spectrum(Voltage),by=c("Subject","Trial","Channel")]
> str(spec.dt)
Classes ‘data.table’ and 'data.frame':  16520000 obs. of  4 variables:
 $ Subject: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Trial  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Channel: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V1     : num  78.6 78.6 78.6 78.5 78.5 ...
 - attr(*, ".internal.selfref")=<externalptr>

Is this right? Well, it's easy to check if it's the right shape. We know that there are 500 frequency bins in the default spectrum() call, and I stated that the data had 118 channels.

> nrow(spec.dt)
[1] 16520000
> nrow(spec.dt)/500
[1] 33040
> nrow(spec.dt)/500/118
[1] 280

I didn't mention it in the original question, but there are indeed 280 trials.

remark

An apparent rule here is that in the by argument, you need to leave out the independent variable corresponding to the dependent data. If you don't, the other error appears.

> spectra.table <- big.table[,get.spectrum(Voltage),by=c("Sample","Subject","Channel")]
Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action,  : 
  'order.max' must be >= 1

Here Voltage is a function of Sample (since sample is an index)--it is repeated over and over again for each Channel and each Subject.

I don't know exactly what the problem is here, though.

benchmarks

> system.time(spec.dt <- erp.table[,get.spectrum(Voltage),by=c("Subject","Trial","Channel")])
   user  system elapsed 
 86.669   3.452  87.414

system.time(
  spectra <- foreach(s.ind = unique(erp.table$Subject), .combine=rbind) %:% 
              foreach(t.ind = unique(erp.table$Trial), .combine=rbind) %dopar% {

                cbind((sampling.rate * (seq_len(freq.bins)-1) / sampling.rate),
                  rep(c.ind,freq.bins),
                  rep(t.ind,freq.bins),
                  get.spectrum((subset(erp.table, 
                   subset=(Subject==s.ind & 
                             Trial==t.ind),
                   select=Voltage))$Voltage),
                  rep(s.ind,freq.bins))

              })
   user  system elapsed 
114.259  17.937 131.873

The second benchmark is optimistic; I had run it a second time without cleaning up the environment or removing variables.

来源：https://stackoverflow.com/questions/21156801/applying-non-trivial-functions-to-ordered-subsets-of-data-table

标签

data-structures

out-of-memory

data.table