问题
Problem
I'm trying to use my newfound data.table powers (for good) to compute the frequency content of a bunch of data that looks like this:
| Sample| Channel| Trial| Voltage|Class | Subject|
|-------:|--------:|------:|-----------:|:------|--------:|
| 1| 1| 1| -196.82253|1 | 1|
| 1| 2| 1| 488.15166|1 | 1|
| 1| 3| 1| -311.92386|1 | 1|
| 1| 4| 1| -297.06078|1 | 1|
| 1| 5| 1| -244.95824|1 | 1|
| 1| 6| 1| -265.96525|1 | 1|
| 1| 7| 1| -258.93263|1 | 1|
| 1| 8| 1| -224.07819|1 | 1|
| 1| 9| 1| -87.06051|1 | 1|
| 1| 10| 1| -183.72961|1 | 1|
There are about 57 million rows--every variable is an integer except Voltage. Sample is an index that goes from 1:350, and Channel goes from 1:118. There are 280 Trials.
sample data
Martín's example data is valid, I believe (the numbers of categorical variables are a non-issue with respect to the errors):
big.table <- data.table(Sample = 1:350, Channel = 1:118, Trial = letters,
Voltage = rnorm(10e5, -150, 100), Class = LETTERS, Subject = 1:20)
process
The first thing I do is set the key to Sample, because I want anything I do to the individual data series to happen in a sane order:
setkey(big.table,Sample)
Then, I do some filtering on the Voltage signals to remove high frequencies. (The filtering function returns a vector of the same length as its second argument):
require(signal)
high.pass <- cheby1(cheb1ord(Wp = 0.14, Ws = 0.0156, Rp = 0.5, Rs = 10))
big.table[,Voltage:=filtfilt(high.pass,Voltage),by=Subject]
initial error
I'd like to see if that processed it properly (i.e. Subject by Subject, Trial by Trial, Channel by Channel, in Sample order), so I add a column containing the spectral content of the Voltage column:
get.spectrum <- function(x) {
spec.obj <- spectrum(x,method="ar",plot=FALSE)
outlist <- list()
outlist$spec <- 20*log10(spec.obj$spec)
outlist$freq <- spec.obj$freq
return(outlist)
}
big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),by=Subject]
Error: cannot allocate vector of size 6.1 Gb
I think the issue is that get.spectrum()
is trying to eat the whole column at once, considering that the whole table is only around 1.7GB. Is that so? What are my options?
What have you tried?
Increasing the granularity of grouping
If I make a call to get.spectrum
including all of the columns I want to group by, I get a more promising error:
big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),
by=c("Subject","Trial","Channel","Sample")]
Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action, :
'order.max' must be >= 1
That implies the spectrum()
function I'm calling is getting data of the wrong shape.
Cutting points down, trying different 'where' conditions
Following Roland's advice, I cut the number of points to around 20 million and tried the below:
big.table[,"Spectrum":=get.spectrum(Voltage),
by=c("Subject","Trial","Channel")]
Error in `[.data.table`(big.table, , `:=`("Spectrum", get.spectrum(Voltage)), :
All items in j=list(...) should be atomic vectors or lists. If you are trying something like
j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge
afterwards.
My thinking was that I shouldn't group by Sample since I want to apply this function to each group of 350 Samples given by the above by
vector.
Improving on that with some things gleaned from section 2.16 of the data.table FAQ, I added the SQL equivalent of an ORDER BY
. I know that the Sample column needs to go from 1:350 for each input to the spectrum()
function:
> big.table[Sample==c(1:350),c("Spectrum","Frequency"):=as.list(get.spectrum(Voltage)),
+ by=c("Subject","Trial","Channel")]
Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action, :
'order.max' must be >= 1
Again, I run into trouble with non-unique inputs.
回答1:
Perhaps this can start to solve the problem:
I believe the error data.table gives is because get.spectrum returns a list with:
spec and freq.
Using this example dataset:
big.table <- data.table(Sample = 1:350, Channel = 1:118, Trial = letters,
Voltage = rnorm(10e5, -150, 100), Class = LETTERS, Subject = 1:20)
str(big.table)
setkey(big.table,Sample)
get.spectrum <- function(x) {
spec.obj <- spectrum(x,method="ar",plot=FALSE)
outlist <- list()
outlist$spec <- 20*log10(spec.obj$spec)
outlist$freq <- spec.obj$freq
return(outlist)
}
VT <- get.spectrum(big.table$Voltage)
str(VT)
# Then you should decide which value you would like to inset in big.table
get.spectrum(big.table$Voltage)$spec
# or
get.spectrum(big.table$Voltage)$freq
This should work. You can also use set()
big.table[, Spectrum:= get.spectrum(Voltage)$spec, by=Subject]
big.table[, Frequency:= get.spectrum(Voltage)$freq, by=Subject]
EDIT As mentioned in the comments, i've tried to provide an answer using set() but i don't see how to "group by" Subject: Here is what i've tried, not sure if it's the intended answer.
cols = c("spec", "freq")
for(inx in cols){
set(big.table, i=NULL, j=j ,value = get.spectrum(big.table[["Voltage"]])[inx])
}
EDIT2 Two functions one for each colum. Using a different combination of group by variables.
spec_fun <- function(x) {
spec.obj <- spectrum(x,method="ar",plot=FALSE)
spec <- 20*log10(spec.obj$spec)
spec
}
freq_fun <- function(x) {
freq <- spectrum(x,method="ar",plot=FALSE)$freq
freq
}
big.table[, Spectrum:= spec_fun(Voltage), by=c("Subject","Trial","Channel")]
big.table[, Frequency:= freq_fun(Voltage), by=c("Subject","Trial","Channel")]
# It gives some warnings(), probaby because of the made up data.
回答2:
After some extended discussion with Martín Bel who was patient enough to listen to me thrash, I was able to work out some of what was going wrong.
initial error
A major issue is that spectrum()
, the function being called on each time-series component of the data.table, expects a 2D structure representing a multivariate time series (in this case, channels x samples
). So this call
big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),by=Subject]
Error: cannot allocate vector of size 6.1 Gb
is totally bad.
brute 'for'ce
Here is a slow way to do it using (mostly useless) parallelization. get.spectrum()
is modified to return a simple vector, which was related to the third error on return types from j
:
get.spectrum <- function(x) {
spec.obj <- spectrum(x,method="ar",plot=FALSE)
outlist <- list()
outlist <- 20*log10(spec.obj$spec)
# outlist$freq <- spec.obj$freq # don't return me
return(outlist)
}
require(parallel)
require(foreach)
freq.bins <- 500
spectra <- foreach(s.ind = unique(big.table$Subject), .combine=rbind) %:% {
foreach(t.ind = unique(big.table$Trial), .combine=rbind) %dopar% {
cbind((sampling.rate * (seq_len(freq.bins)-1) / sampling.rate),
rep(c.ind,freq.bins),
rep(t.ind,freq.bins),
get.spectrum((subset(big.table,
subset=(Subject==s.ind &
Trial==t.ind),
select=Voltage))$Voltage),
rep(s.ind,freq.bins))
}
}
This gives the right result because each input to get.spectrum()
is a subset where Subject and Trial are fixed, leaving Channel and Sample to vary. However, it is quite slow, and spends over 80% of the computational load in 1 of the 4 cores I have on this machine.
data.table approach
I went back to some toy cases that came up in the discussion, and tried this again:
spec.dt <- big.table[,get.spectrum(Voltage),by=c("Subject","Trial")]
This is close! It returns a data.table of almost the right structure.
> str(spec.dt)
Classes ‘data.table’ and 'data.frame': 140000 obs. of 3 variables:
$ Subject: int 1 1 1 1 1 1 1 1 1 1 ...
$ Trial : int 1 1 1 1 1 1 1 1 1 1 ...
$ V1 : num 110.7 109 105.4 101.6 98.2 ...
However, the Channel variable is missing. Easily fixed:
> spec.dt <- erp.table[,get.spectrum(Voltage),by=c("Subject","Trial","Channel")]
> str(spec.dt)
Classes ‘data.table’ and 'data.frame': 16520000 obs. of 4 variables:
$ Subject: int 1 1 1 1 1 1 1 1 1 1 ...
$ Trial : int 1 1 1 1 1 1 1 1 1 1 ...
$ Channel: int 1 1 1 1 1 1 1 1 1 1 ...
$ V1 : num 78.6 78.6 78.6 78.5 78.5 ...
- attr(*, ".internal.selfref")=<externalptr>
Is this right? Well, it's easy to check if it's the right shape. We know that there are 500 frequency bins in the default spectrum()
call, and I stated that the data had 118 channels.
> nrow(spec.dt)
[1] 16520000
> nrow(spec.dt)/500
[1] 33040
> nrow(spec.dt)/500/118
[1] 280
I didn't mention it in the original question, but there are indeed 280 trials.
remark
An apparent rule here is that in the by
argument, you need to leave out the independent variable corresponding to the dependent data. If you don't, the other error appears.
> spectra.table <- big.table[,get.spectrum(Voltage),by=c("Sample","Subject","Channel")]
Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action, :
'order.max' must be >= 1
Here Voltage is a function of Sample (since sample is an index)--it is repeated over and over again for each Channel and each Subject.
I don't know exactly what the problem is here, though.
benchmarks
> system.time(spec.dt <- erp.table[,get.spectrum(Voltage),by=c("Subject","Trial","Channel")])
user system elapsed
86.669 3.452 87.414
system.time(
spectra <- foreach(s.ind = unique(erp.table$Subject), .combine=rbind) %:%
foreach(t.ind = unique(erp.table$Trial), .combine=rbind) %dopar% {
cbind((sampling.rate * (seq_len(freq.bins)-1) / sampling.rate),
rep(c.ind,freq.bins),
rep(t.ind,freq.bins),
get.spectrum((subset(erp.table,
subset=(Subject==s.ind &
Trial==t.ind),
select=Voltage))$Voltage),
rep(s.ind,freq.bins))
})
user system elapsed
114.259 17.937 131.873
The second benchmark is optimistic; I had run it a second time without cleaning up the environment or removing variables.
来源:https://stackoverflow.com/questions/21156801/applying-non-trivial-functions-to-ordered-subsets-of-data-table