Using a sparse matrix versus numpy array

后端 未结 3 1844
Happy的楠姐
Happy的楠姐 2021-02-01 03:23

I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matric

3条回答
  •  刺人心
    刺人心 (楼主)
    2021-02-01 04:12

    The scipy sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dot product for numpy arrays) and equation solvers are well developed.

    My rough experience is that a sparse csr matrix product has to have a 1% sparsity to be faster than the equivalent dense dot operation - in other words, one nonzero value for every 99 zeros. (but see tests below)

    But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the coo format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.

    The scipy package implements many sparse formats. The coo format is easiest to understand and build. Build one according to documentation and look at its .data, .row, and .col attributes (3 1d arrays).

    csr and csc are typically built from the coo format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.

    It is also possible to index csr format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.

    lil (lists of lists) is also easy to understand, and best for incremental building. dok is a actually a dictionary subclass.

    A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrix class (though it isn't a subclass).

    A search for other questions using scikit-learn and sparse might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learn side. So far the sparse package has not been optimized for this application.


    I just tried some matrix product tests, using the sparse.random method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.

    In [251]: M=sparse.random(1000,1000,.5)
    
    In [252]: timeit M1=M*M
    1 loops, best of 3: 2.78 s per loop
    
    In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
    1 loops, best of 3: 4.28 s per loop
    

    It is a size issue; for smaller matrix the dense dot is faster

    In [255]: M=sparse.random(100,100,.5)
    
    In [256]: timeit M1=M*M
    100 loops, best of 3: 3.24 ms per loop
    
    In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma)
    1000 loops, best of 3: 1.44 ms per loop
    

    But compare indexing

    In [268]: timeit M.tocsr()[500,500]
    10 loops, best of 3: 86.4 ms per loop
    
    In [269]: timeit Ma[500,500]
    1000000 loops, best of 3: 318 ns per loop
    
    In [270]: timeit Ma=M.toarray();Ma[500,500]
    10 loops, best of 3: 23.6 ms per loop
    

提交回复
热议问题