sparse-matrix

clustering on very large sparse matrix?

戏子无情 提交于 2019-12-01 00:11:56
I am trying to do some (k-means) clustering on a very large matrix. The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row). I want to get around 2000 clusters. I got two questions: - Can someone recommend an open source platform or tool for doing that (maybe using k-means, maybe with something better)? - How can I best estimate the time the algorithm will need to finish? I tried weka once, but aborted the job after a couple of days because I couldn't tell how much time it would take. Thanks! http://lucene.apache.org/mahout/ For your case, I

Multithreaded sparse matrix multiplication in Matlab

左心房为你撑大大i 提交于 2019-11-30 23:53:01
问题 I am performing several matrix multiplications of an NxN sparse (~1-2%) matrix, let's call it B, with an NxM dense matrix, let's call it A (where M < N). N is large, as is M; on the order of several thousands. I am running Matlab 2013a. Now, usually, matrix multiplications and most other matrix operations are implicitly parallelized in Matlab, i.e. they make use of multiple threads automatically. This appears NOT to be the case if either of the matrices are sparse (see e.g. this StackOverflow

Efficient slicing of matrices using matrix multiplication, with Python, NumPy, SciPy

守給你的承諾、 提交于 2019-11-30 23:52:57
I want to reshape a 2d scipy.sparse.csr.csr_matrix (let us call it A ) to a 2d numpy.ndarray (let us call this B ). A could be >shape(A) (90, 10) then B should be >shape(B) (9,10) where each 10 rows of A would be reshaped in a new new value, namely the maximum of this window and column. The column operator is not working on this unhashable type of a sparse matrix. How can I get this B by using matrix multiplications? Using matrix multiplication you can do en efficient slicing creating a "slicer" matrix with ones at the right places. The sliced matrix will have the same type as the "slicer", so

Calculate the euclidean distance in scipy csr matrix

谁说胖子不能爱 提交于 2019-11-30 23:27:50
I need to calculate the Euclidean Distance between all points that is stored in csr sparse matrix and some lists of points. It would be easier for me to convert the csr to a dense one, but I couldn't due to the lack of memory, so I need to keep it as csr. So for example I have this data_csr sparse matrix (view in both, csr and dense): data_csr (0, 2) 4 (1, 0) 1 (1, 4) 2 (2, 0) 2 (2, 3) 1 (3, 5) 1 (4, 0) 4 (4, 2) 3 (4, 3) 2 data_csr.todense() [[0, 0, 4, 0, 0, 0] [1, 0, 0, 0, 2, 0] [2, 0, 0, 1, 0, 0] [0, 0, 0, 0, 0, 1] [4, 0, 3, 2, 0, 0]] and this center lists of points: center array([[0, 1, 2,

Weka printing sparse arff file

烈酒焚心 提交于 2019-11-30 23:14:02
I was trying out the sparse representation of the arff file as shown here . In my program I am able to print the the class label "B" but for some reason it is not printing "A". attVals = new FastVector(); attVals.addElement("A"); attVals.addElement("B"); atts.addElement(new Attribute("class", attVals)); vals[index] = attVals.indexOf("A"); The output for the program is like - {0 6,2 8} --- I should get {0 6,2 8,3 A} But when I do vals[index] = attVals.indexOf("B"); I get proper output - {0 6,2 8,3 B} For some reason it is not taking the index 0. Can someone tell me why this is happening? This

How to add two Sparse Vectors in Spark using Python

淺唱寂寞╮ 提交于 2019-11-30 22:24:17
I've searched everywhere but I couldn't find how to add two sparse vectors using Python. I want to add two sparse vectors like this:- (1048576, {110522: 0.6931, 521365: 1.0986, 697409: 1.0986, 725041: 0.6931, 749730: 0.6931, 962395: 0.6931}) (1048576, {4471: 1.0986, 725041: 0.6931, 850325: 1.0986, 962395: 0.6931}) Something like this should work: from pyspark.mllib.linalg import Vectors, SparseVector, DenseVector import numpy as np def add(v1, v2): """Add two sparse vectors >>> v1 = Vectors.sparse(3, {0: 1.0, 2: 1.0}) >>> v2 = Vectors.sparse(3, {1: 1.0}) >>> add(v1, v2) SparseVector(3, {0: 1.0

Element-wise power of scipy.sparse matrix

我是研究僧i 提交于 2019-11-30 22:06:44
问题 How do I raise a scipy.sparse matrix to a power, element-wise? numpy.power should, according to its manual, do this, but it fails on sparse matrices: >>> X <1353x32100 sparse matrix of type '<type 'numpy.float64'>' with 144875 stored elements in Compressed Sparse Row format> >>> np.power(X, 2) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../scipy/sparse/base.py", line 347, in __pow__ raise TypeError('matrix is not square') TypeError: matrix is not square Same

scipy.sparse dot extremely slow in Python

这一生的挚爱 提交于 2019-11-30 21:32:50
The following code will not even finish on my system: import numpy as np from scipy import sparse p = 100 n = 50 X = np.random.randn(p,n) L = sparse.eye(p,p, format='csc') X.T.dot(L).dot(X) Is there any explanation why this matrix multiplication is hanging? X.T.dot(L) is not, as you may think, a 50x100 matrix, but an array of 50x100 sparse matrices of 100x100 >>> X.T.dot(L).shape (50, 100) >>> X.T.dot(L)[0,0] <100x100 sparse matrix of type '<type 'numpy.float64'>' with 100 stored elements in Compressed Sparse Column format> It seems that the problem is that X 's dot method, it being an array,

scipy.sparse default value

我们两清 提交于 2019-11-30 19:35:57
The sparse matrix format (dok) assumes that values of keys not in the dictionary are equal to zero. Is there any way to make it use a default value other than zero? Also, is there a way to calculate the log of a sparse matrix (akin to np.log in regular numpy matrix) That feature is not built-in, but if you really need this, you should be able to write your own dok_matrix class, or subclass Scipy's one. The Scipy implementation is here: https://github.com/scipy/scipy/blob/master/scipy/sparse/dok.py At least in the places where dict.* calls are made, the default value needs to be changed --- and

Randomly shuffle a sparse matrix in python

狂风中的少年 提交于 2019-11-30 19:16:42
is there an easy way to shuffle a sparse matrix in python? This is how I shuffle a non-sparse matrix: index = np.arange(np.shape(matrix)[0]) np.random.shuffle(index) return matrix[index] How can I do it with numpy sparse? Ok, found it. The sparse format looks a bit confusing in the print-out. index = np.arange(np.shape(matrix)[0]) print index np.random.shuffle(index) return matrix[index, :] shaneb In case anyone is looking to randomly get a subsample of rows from a sparse matrix, this related post may also be useful: How should I go about subsampling from a scipy.sparse.csr.csr_matrix and a