Using a sparse matrix versus numpy array

后端 未结 3 1855
Happy的楠姐
Happy的楠姐 2021-02-01 03:23

I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matric

3条回答
  •  庸人自扰
    2021-02-01 04:04

    a sparse matrix is a matrix in which most of the elements are zero Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?

    There is no general rule. It solely depends on your exact usage later on. You have to compute complexity of the model based on sparse matrix and without, and then you can find the "sweet spot". This will depend on both number of samples and dimension. In general, it often boils down to matrix multiplications of the form

    X' W
    

    where X is data matrix N x d, and W is some weight matrix d x K. Consequently "dense" multiplication takes NdK time, while sparse, assuming that your average per-row sparsity is p is NpdK. Thus if your sparsity is 50% you can expect nearly 2x faster operation. The harder part is to estimate the overhead of sparse access as opposed to heavily optimized dense based.

    How much does a sparse matrix help performance in a task like mine, especially compared to a numpy array or a standard list?

    For a particular case of LR, this can be even few times faster than dense format, but in order to observe the difference you need lots of data (>1000) of high dimension (>100).

    So far, I collect my data into a numpy array, then convert into the csr_matrix in Scipy. Is that the right way to do it? I could not figure out how to build a sparse matrix from the ground up, and that might be impossible.

    No, it is not a good approach. You can build it "from scratch" by for example first building a dictionary and then converting it etc. there are plenty of ways to construct sparse matrix without a dense one in the first place.

提交回复
热议问题