sparse-matrix | 易学教程

Sparse vector RDD in pyspark

阅读更多关于 Sparse vector RDD in pyspark

问题 I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib: https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question

Convert Two column data frame to occurrence matrix in pandas

阅读更多关于 Convert Two column data frame to occurrence matrix in pandas

问题 Hi all I have a csv file which contains data as the format below A a A b B f B g B e B h C d C e C f The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h] I want to convert this to occurence matrix look like below a,b,c,d,e,f,g,h A 1,1,0,0,0,0,0,0 B 0,0,0,0,1,1,1,1 C 0,0,0,1,1,1,0,0 Can anyone tell me how to do this using pandas? 回答1: Here is another way to do it using pd.get_dummies() . import pandas as pd # your data # ===============

Finding eigenvectors and eigenvalues of a sparse matrix with ARPACK ( called form PYTHON, MATLAB or as a FORTRAN subroutine)

阅读更多关于 Finding eigenvectors and eigenvalues of a sparse matrix with ARPACK ( called form PYTHON, MATLAB or as a FORTRAN subroutine)

问题 Few days ago I asked a question how to find the eigenvalues of a large sparse matrix. I got no answers, so I decided to describe a potential solution. One question remains: Can I use the python implementation of ARPACK to compute the eigenvalues of a asymmetric sparse matrix. At the beginning I would like to say that it is not at all necessary to call the subroutines of ARPACK directly using FOTRAN driver program. That is quite difficult and I never got it going. But one can do the following:

How to build pysparse on Ubuntu

阅读更多关于 How to build pysparse on Ubuntu

问题 When I try to install pysparse via pip install pysparse==1.3-dev , the build fails with the error: pysparse/sparse/src/spmatrixmodule.c:4:22: fatal error: spmatrix.h: No such file or directory These kinds of errors are usually the result of some missing system dev package, but googling doesn't show anything for "spmatrix". I tried installing the python-sparse package, which does provide this file, but I still get the same error. How do I fix this? 回答1: In this dev-1.3 pakage there were no ".h

k-means clustering on term-term co-ocurrence matrix

阅读更多关于 k-means clustering on term-term co-ocurrence matrix

问题 I derive a term-term co-occurrence matrix, K from a Document-Term Matrix in R. I am interested in carrying out a K-means clustering analysis on the keyword-by-keyword matrix, K. The dimension of K is 8962 terms x 8962 terms. I pass K to the kmeans function as follows: for(i in 1:25){ #Run kmeans for each level of i, allowing up to 100 iterations for convergence kmeans<- kmeans(x=K, centers=i, iter.max=100) #Combine cluster number and cost together, write to df cost_df<- rbind(cost_df, cbind(i

Python - Efficient Function with scipy sparse Matrices

阅读更多关于 Python - Efficient Function with scipy sparse Matrices

问题 for a project, I need an efficient function in python that solves to following task: Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this: Compare each Element of the two Vectors, always take the lower one and sum them up. Example: X = [[0,0,2], [1,0,0], [3,1,0]] Y = [[1,0,2]] The function should return dist = [2,1

Create dense matrix from sparse matrix efficently (numpy/scipy but NO sklearn)

阅读更多关于 Create dense matrix from sparse matrix efficently (numpy/scipy but NO sklearn)

问题 I have a sparse.txt that looks like this: # first column is label 0 or 1 # rest of the data is sparse data # maximum value in the data is 4, so the future dense matrix will # have 1+4 = 5 elements in a row # file: sparse.txt 1 1:1 2:1 3:1 0 1:1 4:1 1 2:1 3:1 4:1 The required dense.txt is this: # required file: dense.txt 1 1 1 1 0 0 1 0 0 1 1 0 1 1 1 Without using scipy coo_matrix it did it in a simple way like this: def create_dense(fsparse, fdense,fvocab): # number of lines in vocab lvocab =

How to represent an array with empty elements in JSON?

阅读更多关于 How to represent an array with empty elements in JSON?

问题 I have an array in JavaScript that looks like this: var pattern = [ ["C5", 3], , , , , , , , ["C5", 3], , , ] I want to store it in a json file like this: { "pattern": [ ["C5", 3], , , , , , , , ["C5", 3], , , ] } JSONLint tells me this: Parse error on line 6: ... ], , , ---------------------^ Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[' So I understand I can't let the space between commas empty. What's similar to empty, but is accepted by JSON standards? This pattern file

Numpy/Scipy broadcast calculating scalar product for a certain elements

阅读更多关于 Numpy/Scipy broadcast calculating scalar product for a certain elements

问题 I've a sparse matrix like A and a dataframe(df) with rows that should be taken to calculate scalar product. Row1 Row2 Value 2 147 scalar product of vectors at Row1 and Raw2 in matrix A Can I do it in broadcasting manner without looping etc? In my case A like 1m*100k size and the dataframe 10M 回答1: Start with a small 'sparse' matrix (csr is the best for math): In [167]: A=sparse.csr_matrix([[1, 2, 3], # Vadim's example [2, 1, 4], [0, 2, 2], [3, 0, 3]]) In [168]: AA=A.A # dense equivalent In

How to use block_diag repeatedly

阅读更多关于 How to use block_diag repeatedly

问题 I have rather simple question but still couldn´t make it work. I want a block diagonal n^2*n^2 matrix. The blocks are sparse n*n matrices with just the diagonal, first off diagonals and forth off diag. For the simple case of n=4 this can easily be done datanew = ones((5,n1)) datanew[2] = -2*datanew[2] diagsn = [-4,-1,0,1,4] DD2 = sparse.spdiags(datanew,diagsn,n,n) new = sparse.block_diag([DD2,DD2,DD2,DD2]) Since this only useful for small n's, is there a way better way to use block_diag?