问题
I have gene expression data as number of counts for each probe, something like this:
library(data.table)
mydata <- fread(
"molclass,mol.id,sample1,sample2,sample3
negative, negat1, 0, 1, 2
negative, negat2, 2, 1, 1
negative, negat3, 1, 2, 0
endogen, gene1, 30, 15, 10
endogen, gene2, 60, 30, 20
")
My question here is - what would be the best way to perform background subtraction, i.e. for each sampleN
column I need to calculate background (let's say it will be the average of all values from negative
class) and then subtract this background from each value of this column. For the moment I am using the following solution:
for (nm in names(mydata)[-c(1:2)]) {
bg <- mydata[molclass=='negative', nm, with=F];
bg <- mean(unlist(bg));
mydata[[nm]] <- (mydata[[nm]] - bg);
}
but I feel there must be some "nicer" way.
P.S. I know that there are some packages that do those things, but my data correspond to the number of counts, not intensity of signal - so I can't use limma
or similar tools designed for microarrays. Maybe some seq-data packages could help, but I am not sure because my data is not from sequencing either.
回答1:
Generally, you shouldn't use <-
with a data.table
. The last assignment in your loop would be better with set
. See the help page by typing ?set
for details.
mycols <- paste0('sample',1:3)
newcols <- paste0(mycols,'bk')
s <- mydata[['molclass']] == 'negative'
mybkds <- sapply(mycols,function(j) mean(mydata[[j]][s]) )
mydata[,(newcols):=NA]
for (j in mycols) set(mydata,j=paste0(j,'bk'),value=mydata[[j]]-mybkds[j])
I've only done the last step in a loop, but this is basically the same as your code (where everything is in the loop). *apply
functions and loops are just different syntax, I've heard, and you can go with whichever you prefer.
回答2:
If you need to replace the sample
columns with the calculated values, you can use set
(as in @Frank's post) but without creating an additional object
indx <- grep('^sample', names(mydata))
for(j in indx){
set(mydata, i=NULL, j=j, value=mydata[[j]]-
mydata[molclass=='negative', mean(unlist(.SD)), .SDcols=j])
}
mydata
# molclass mol.id sample1 sample2 sample3
#1: negative negat1 -1 -0.3333333 1
#2: negative negat2 1 -0.3333333 0
#3: negative negat3 0 0.6666667 -1
#4: endogen gene1 29 13.6666667 9
#5: endogen gene2 59 28.6666667 19
Or a variant (more efficient) as suggested by @Frank would be
for(j in indx){
set(mydata, i=NULL, j=j, value=mydata[[j]]-
mean(mydata[[j]][mydata$molclass=='negative']))
}
来源:https://stackoverflow.com/questions/28597878/background-subtraction-in-a-table