background subtraction in a table

问题

I have gene expression data as number of counts for each probe, something like this:

library(data.table)
mydata <- fread(
"molclass,mol.id,sample1,sample2,sample3
negative, negat1,  0, 1,   2
negative, negat2,  2, 1,   1
negative, negat3,  1, 2,   0
 endogen,  gene1, 30, 15, 10
 endogen,  gene2, 60, 30, 20
")

My question here is - what would be the best way to perform background subtraction, i.e. for each sampleN column I need to calculate background (let's say it will be the average of all values from negative class) and then subtract this background from each value of this column. For the moment I am using the following solution:

for (nm in names(mydata)[-c(1:2)]) {
  bg <- mydata[molclass=='negative', nm, with=F];
  bg <- mean(unlist(bg));
  mydata[[nm]] <- (mydata[[nm]] - bg);
}

but I feel there must be some "nicer" way.

P.S. I know that there are some packages that do those things, but my data correspond to the number of counts, not intensity of signal - so I can't use limma or similar tools designed for microarrays. Maybe some seq-data packages could help, but I am not sure because my data is not from sequencing either.

回答1:

Generally, you shouldn't use <- with a data.table. The last assignment in your loop would be better with set. See the help page by typing ?set for details.

mycols  <- paste0('sample',1:3)
newcols <- paste0(mycols,'bk')

s       <- mydata[['molclass']] == 'negative'
mybkds  <- sapply(mycols,function(j) mean(mydata[[j]][s]) )

mydata[,(newcols):=NA]
for (j in mycols) set(mydata,j=paste0(j,'bk'),value=mydata[[j]]-mybkds[j])

I've only done the last step in a loop, but this is basically the same as your code (where everything is in the loop). *apply functions and loops are just different syntax, I've heard, and you can go with whichever you prefer.

回答2:

If you need to replace the sample columns with the calculated values, you can use set (as in @Frank's post) but without creating an additional object

indx <- grep('^sample', names(mydata))
for(j in indx){
 set(mydata, i=NULL, j=j, value=mydata[[j]]- 
       mydata[molclass=='negative', mean(unlist(.SD)), .SDcols=j])
}
mydata
#   molclass  mol.id sample1    sample2 sample3
#1: negative  negat1      -1 -0.3333333       1
#2: negative  negat2       1 -0.3333333       0
#3: negative  negat3       0  0.6666667      -1
#4:  endogen   gene1      29 13.6666667       9
#5:  endogen   gene2      59 28.6666667      19

Or a variant (more efficient) as suggested by @Frank would be

for(j in indx){
 set(mydata, i=NULL, j=j, value=mydata[[j]]- 
    mean(mydata[[j]][mydata$molclass=='negative']))
}

来源：https://stackoverflow.com/questions/28597878/background-subtraction-in-a-table

标签

data.table

bioconductor