Is there a faster way to subset a sparse Matrix than '['?

问题

I'm the maintainer of the seqMeta package and looking for ideas on how to speed up the bottleneck of splitting a large matrix into smaller pieces a large number of times.

Background

The seqMeta package is used to analyze genetic data. So you have a group of subjects (n_subject) and a number of genetic markers (n_snps). This leads to a n_subject x n_snp matrix (Z). There is also a data frame that tells you which snps get grouped together for analysis (typically which snps comprise a given gene).

While Z may be large, it is quite sparse. Typically less than 10%, and sometimes around 2%, of the values are non-zero. A sprase matrix representation seems like an obvious choice to save space.

A current project: nsubjects ~15,000 and nsnps ~2 million, with over 200,000 splits.

As the data size continues to grow I've found the time limiting factor tends to be the number of groupings, not the actual size of the data. (See the below example runtime is a linear function of n_splits for the same data)

Simplified Example

library(Matrix)

seed(1)

n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05


n <- floor(n_subjects*n_snps*sparcity) 

# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals

# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##    3.21    0.00    3.21  

n_splits <- 1000
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##    6.38    0.00    6.38

n_splits <- 5000
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##   31.65    0.00   31.66

Question: Is there a faster way to subset a Matrix than '['? Or other approached I'm missing?

来源：https://stackoverflow.com/questions/39670743/is-there-a-faster-way-to-subset-a-sparse-matrix-than

标签

matrix

sparse-matrix