sparse-matrix

Sparse vector RDD in pyspark

余生长醉 提交于 2019-12-12 12:26:31
问题 I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib: https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question

Convert Two column data frame to occurrence matrix in pandas

亡梦爱人 提交于 2019-12-12 12:16:12
问题 Hi all I have a csv file which contains data as the format below A a A b B f B g B e B h C d C e C f The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h] I want to convert this to occurence matrix look like below a,b,c,d,e,f,g,h A 1,1,0,0,0,0,0,0 B 0,0,0,0,1,1,1,1 C 0,0,0,1,1,1,0,0 Can anyone tell me how to do this using pandas? 回答1: Here is another way to do it using pd.get_dummies() . import pandas as pd # your data # ===============

Finding eigenvectors and eigenvalues of a sparse matrix with ARPACK ( called form PYTHON, MATLAB or as a FORTRAN subroutine)

主宰稳场 提交于 2019-12-12 10:17:26
问题 Few days ago I asked a question how to find the eigenvalues of a large sparse matrix. I got no answers, so I decided to describe a potential solution. One question remains: Can I use the python implementation of ARPACK to compute the eigenvalues of a asymmetric sparse matrix. At the beginning I would like to say that it is not at all necessary to call the subroutines of ARPACK directly using FOTRAN driver program. That is quite difficult and I never got it going. But one can do the following:

How to build pysparse on Ubuntu

一个人想着一个人 提交于 2019-12-12 09:41:57
问题 When I try to install pysparse via pip install pysparse==1.3-dev , the build fails with the error: pysparse/sparse/src/spmatrixmodule.c:4:22: fatal error: spmatrix.h: No such file or directory These kinds of errors are usually the result of some missing system dev package, but googling doesn't show anything for "spmatrix". I tried installing the python-sparse package, which does provide this file, but I still get the same error. How do I fix this? 回答1: In this dev-1.3 pakage there were no ".h

k-means clustering on term-term co-ocurrence matrix

可紊 提交于 2019-12-12 06:31:57
问题 I derive a term-term co-occurrence matrix, K from a Document-Term Matrix in R. I am interested in carrying out a K-means clustering analysis on the keyword-by-keyword matrix, K. The dimension of K is 8962 terms x 8962 terms. I pass K to the kmeans function as follows: for(i in 1:25){ #Run kmeans for each level of i, allowing up to 100 iterations for convergence kmeans<- kmeans(x=K, centers=i, iter.max=100) #Combine cluster number and cost together, write to df cost_df<- rbind(cost_df, cbind(i

Python - Efficient Function with scipy sparse Matrices

旧城冷巷雨未停 提交于 2019-12-12 05:27:01
问题 for a project, I need an efficient function in python that solves to following task: Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this: Compare each Element of the two Vectors, always take the lower one and sum them up. Example: X = [[0,0,2], [1,0,0], [3,1,0]] Y = [[1,0,2]] The function should return dist = [2,1

Create dense matrix from sparse matrix efficently (numpy/scipy but NO sklearn)

北慕城南 提交于 2019-12-12 05:23:37
问题 I have a sparse.txt that looks like this: # first column is label 0 or 1 # rest of the data is sparse data # maximum value in the data is 4, so the future dense matrix will # have 1+4 = 5 elements in a row # file: sparse.txt 1 1:1 2:1 3:1 0 1:1 4:1 1 2:1 3:1 4:1 The required dense.txt is this: # required file: dense.txt 1 1 1 1 0 0 1 0 0 1 1 0 1 1 1 Without using scipy coo_matrix it did it in a simple way like this: def create_dense(fsparse, fdense,fvocab): # number of lines in vocab lvocab =

How to represent an array with empty elements in JSON?

房东的猫 提交于 2019-12-12 05:12:58
问题 I have an array in JavaScript that looks like this: var pattern = [ ["C5", 3], , , , , , , , ["C5", 3], , , ] I want to store it in a json file like this: { "pattern": [ ["C5", 3], , , , , , , , ["C5", 3], , , ] } JSONLint tells me this: Parse error on line 6: ... ], , , ---------------------^ Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[' So I understand I can't let the space between commas empty. What's similar to empty, but is accepted by JSON standards? This pattern file

Numpy/Scipy broadcast calculating scalar product for a certain elements

谁都会走 提交于 2019-12-12 05:12:44
问题 I've a sparse matrix like A and a dataframe(df) with rows that should be taken to calculate scalar product. Row1 Row2 Value 2 147 scalar product of vectors at Row1 and Raw2 in matrix A Can I do it in broadcasting manner without looping etc? In my case A like 1m*100k size and the dataframe 10M 回答1: Start with a small 'sparse' matrix (csr is the best for math): In [167]: A=sparse.csr_matrix([[1, 2, 3], # Vadim's example [2, 1, 4], [0, 2, 2], [3, 0, 3]]) In [168]: AA=A.A # dense equivalent In

How to use block_diag repeatedly

余生颓废 提交于 2019-12-12 04:19:42
问题 I have rather simple question but still couldn´t make it work. I want a block diagonal n^2*n^2 matrix. The blocks are sparse n*n matrices with just the diagonal, first off diagonals and forth off diag. For the simple case of n=4 this can easily be done datanew = ones((5,n1)) datanew[2] = -2*datanew[2] diagsn = [-4,-1,0,1,4] DD2 = sparse.spdiags(datanew,diagsn,n,n) new = sparse.block_diag([DD2,DD2,DD2,DD2]) Since this only useful for small n's, is there a way better way to use block_diag?