How to order sparse matrix and store the result

问题

I have a big sparse matrix:

> str(qtr_sim)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:32395981] 0 1 2 3 4 5 6 7 8 1 ...
  ..@ p       : int [1:28182] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ Dim     : int [1:2] 28181 28181
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:28181] "1000191" "1000404" "1000457" "1000541" ...
  .. ..$ : chr [1:28181] "1000191" "1000404" "1000457" "1000541" ...
  ..@ x       : num [1:32395981] 1 1 1 1 1 ...
  ..@ uplo    : chr "U"
  ..@ factors : list()

The matrix contains values of cosine similarity - the numbers between 0 and 1.

An example of such a matrix, where A,...,E I will call "products":

>A
5 x 5 sparse Matrix of class "dgCMatrix"
     A    B    C   D    E
A 1.00 0.51 .    .   0.03
B 0.51 1.00 0.40 .   0.06
C .    0.40 1.00 0.1 0.05
D .    .    0.10 1.0 .   
E 0.03 0.06 0.05 .   1.00


> dput(A)
new("dgCMatrix"
    , i = c(0L, 1L, 4L, 0L, 1L, 2L, 4L, 1L, 2L, 3L, 4L, 2L, 3L, 0L, 1L, 
2L, 4L)
    , p = c(0L, 3L, 7L, 11L, 13L, 17L)
    , Dim = c(5L, 5L)
    , Dimnames = list(c("A", "B", "C", "D", "E"), c("A", "B", "C", "D", "E"))
    , x = c(1, 0.51, 0.03, 0.51, 1, 0.4, 0.06, 0.4, 1, 0.1, 0.05, 0.1, 
1, 0.03, 0.06, 0.05, 1)
    , factors = list()
)

I need to find a fast way to obtain from matrix A two matrices B, C:

>B
5 x 5 sparse Matrix of class "dgCMatrix"
            A       B       C       D       E     
  [1,]   1.00    1.00    1.00     1.0    1.00
  [2,]   0.51    0.51    0.40     0.1    0.06      
  [3,]   0.03    0.40    0.10       .    0.05
  [4,]      .    0.06    0.05       .    0.03   
  [5,]      .       .       .       .       .

>C
            A       B       C       D       E     
  [1,]      A       B       C       D       E
  [2,]      B       A       B       C       B      
  [3,]      E       C       D      NA       C
  [4,]     NA       E       E      NA       A   
  [5,]     NA      NA      NA      NA      NA

There dosn't have to be "NA" but I use it in my code (see below).

I'm using this approach:

  B <- C <- matrix(NA, nrow = nrow(A), ncol = ncol(A))
  colnames(C) <- colnames(B) <- colnames(A)

  for (j in 1:nrow(A)){
    c <- A[ ,2, drop = F]
    posi <- colnames(c)

    d <- order(c, decreasing = T)
    mat <- c[d, ]

    if (which(names(mat) == posi) != 1){
      firstr <- mat[which(names(mat) == posi)]
      mat <- mat[-which(names(mat) == posi)]
      mat <- c(firstr,mat)
    } #this is because sometimes similarity of value 1 doesn't
      #only belong to one products and I need first row = column 
      #names !!!! The next product with similarity 1 should be 
      #in next row and so on.


    myNAs <- lapply(mat, function(x) which(x == 0))
    a <- as.numeric(which(myNAs == 1))
    names(mat)[a] <- NA
    C[,j] <- names(mat)
    B[,j] <- as.numeric(mat)
  }

But this approach is really slow. Note the original sparse matrix is much bigger then this example A.

How can I improve my approach?

回答1:

OK, maybe this is of use:

library(data.table)
DT <- data.table(val = A@x, i = A@i + 1L, 
                 product = rownames(A)[A@i + 1L],
                 j = rep(rownames(A), diff(A@p)))
setorderv(DT, c("j", "val"), c(1L, -1L))
DT[, newi := seq_len(.N), by = j]

dcast(DT, newi ~ j, value.var = "val")
#   newi    A    B    C   D    E
#1:    1 1.00 1.00 1.00 1.0 1.00
#2:    2 0.51 0.51 0.40 0.1 0.06
#3:    3 0.03 0.40 0.10  NA 0.05
#4:    4   NA 0.06 0.05  NA 0.03
dcast(DT, newi ~ j, value.var = "product")
#   newi  A B C  D E
#1:    1  A B C  D E
#2:    2  B A B  C B
#3:    3  E C D NA C
#4:    4 NA E E NA A

Of course, the reshape could potentially create large dense objects, thereby exhausting your memory. If that is a problem, you'll need to reverse the first step and try creating a sparse matrix using newi, j and val.

来源：https://stackoverflow.com/questions/46116609/how-to-order-sparse-matrix-and-store-the-result

标签

sparse-matrix