R - data frame - convert to sparse matrix

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

I have a data frame which is mostly zeros (sparse data frame?) something similar to

name,factor_1,factor_2,factor_3 ABC,1,0,0 DEF,0,1,0 GHI,0,0,1 

The actual data is about 90,000 rows with 10,000 features. Can I convert this to a sparse matrix? I am expecting to gain time and space efficiencies by utilizing a sparse matrix instead of a data frame.

Any help would be appreciated

Update #1: Here is some code to generate the data frame. Thanks Richard for providing this

x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF", "GHI"),                     class = "factor"),                 factor_1 = c(1L, 0L, 0L),                 factor_2 = c(0L,1L, 0L),                 factor_3 = c(0L, 0L, 1L)),                 .Names = c("name", "factor_1","factor_2", "factor_3"),                 class = "data.frame",                row.names = c(NA,-3L)) 

回答1:

It might be a bit more memory efficient (but slower) to avoid copying all the data into a dense matrix:

y <- Reduce(cbind2, lapply(x[,-1], Matrix, sparse = TRUE)) rownames(y) <- x[,1]  #3 x 3 sparse Matrix of class "dgCMatrix" #          #ABC 1 . . #DEF . 1 . #GHI . . 1 

If you have sufficient memory you should use Richard's answer, i.e., turn your data.frame into a dense matrix and than use Matrix.



回答2:

I do this all the time and it's a pain in the butt, so I wrote a method for it called sparsify() in my R package - mltools. It operates on data.tables which are just fancy data.frames.


To solve your specific problem...

Install mltools (or just copy the sparsify() method into your environment)

Load packages

library(data.table) library(Matrix) library(mltools) 

Sparsify

x <- data.table(x)  # convert x to a data.table sparseM <- sparsify(x[, !"name"])  # sparsify everything except the name column rownames(sparseM) <- x$name  # set the rownames  > sparseM 3 x 3 sparse Matrix of class "dgCMatrix"     factor_1 factor_2 factor_3 ABC        1        .        . DEF        .        1        . GHI        .        .        1 

In general, the sparsify() method is pretty flexible. Here's some examples of how you can use it:

Make some data. Notice data types and unused factor levels

dt <- data.table(   intCol=c(1L, NA_integer_, 3L, 0L),   realCol=c(NA, 2, NA, NA),   logCol=c(TRUE, FALSE, TRUE, FALSE),   ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE),   ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE) ) > dt    intCol realCol logCol ofCol ufCol 1:      1      NA   TRUE     a     a 2:     NA       2  FALSE     b    NA 3:      3      NA   TRUE    NA     c 4:      0      NA  FALSE     b     b 

Out-Of-The-Box Use

> sparsify(dt) 4 x 7 sparse Matrix of class "dgCMatrix"      intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c [1,]      1      NA      1     1       1       .       . [2,]     NA       2      .     2      NA      NA      NA [3,]      3      NA      1    NA       .       .       1 [4,]      .      NA      .     2       .       1       . 

Convert NAs to 0s and Sparsify Them

> sparsify(dt, sparsifyNAs=TRUE) 4 x 7 sparse Matrix of class "dgCMatrix"      intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c [1,]      1       .      1     1       1       .       . [2,]      .       2      .     2       .       .       . [3,]      3       .      1     .       .       .       1 [4,]      .       .      .     2       .       1       . 

Generate Columns That Identify NA Values

> sparsify(dt[, list(realCol)], naCols="identify") 4 x 2 sparse Matrix of class "dgCMatrix"      realCol_NA realCol [1,]          1      NA [2,]          .       2 [3,]          1      NA [4,]          1      NA 

Generate Columns That Identify NA Values In the Most Memory Efficient Manner

> sparsify(dt[, list(realCol)], naCols="efficient") 4 x 2 sparse Matrix of class "dgCMatrix"      realCol_NotNA realCol [1,]             .      NA [2,]             1       2 [3,]             .      NA [4,]             .      NA 


回答3:

You could make the first column into row names, then use Matrix from the Matrix package.

rownames(x) <- x$name x <- x[-1] library(Matrix) Matrix(as.matrix(x), sparse = TRUE) # 3 x 3 sparse Matrix of class "dtCMatrix" #     factor_1 factor_2 factor_3 # ABC        1        .        . # DEF        .        1        . # GHI        .        .        1 

where the original x data frame is

x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF",  "GHI"), class = "factor"), factor_1 = c(1L, 0L, 0L), factor_2 = c(0L,  1L, 0L), factor_3 = c(0L, 0L, 1L)), .Names = c("name", "factor_1",  "factor_2", "factor_3"), class = "data.frame", row.names = c(NA,  -3L)) 


回答4:

Just how sparse is your matrix? That determines how how to improve it's size.

Your example matrix has 3 1s and 6 0s. With that ratio, there's little space savings by naively using Matrix.

> library('pryr') # for object_size > library('Matrix') > m <- matrix(rbinom(9e4*1e4, 1, 1/3), ncol = 1e4) > object_size(m) 3.6 GB > object_size(Matrix(m, sparse = T)) 3.6 GB 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!