hstack csr matrix with pandas array

问题

I am doing an exercise on Amazon Reviews, Below is the code. Basically I am not able to add column (pandas array) to CSR Matrix which i got after applying BoW. Even though the number of rows in both matrices matches i am not able to get through.

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.manifold import TSNE

#Create Connection to sqlite3
con = sqlite3.connect('C:/Users/609316120/Desktop/Python/Amazon_Review_Exercise/database/database.sqlite')

filtered_data = pd.read_sql_query("""select * from Reviews where Score != 3""", con)
def partition(x):
    if x < 3:
       return 'negative'
    return 'positive'

actualScore = filtered_data['Score']
actualScore.head()
positiveNegative = actualScore.map(partition)
positiveNegative.head(10)
filtered_data['Score'] = positiveNegative
filtered_data.head(1)
filtered_data.shape

display = pd.read_sql_query("""select * from Reviews where Score !=3 and Userid="AR5J8UI46CURR" ORDER BY PRODUCTID""", con)

sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)

final.shape

display = pd.read_sql_query(""" select * from reviews where score != 3 and id=44737 or id = 64422 order by productid""", con)

final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

final['Score'].value_counts()

count_vect = CountVectorizer()

final_counts = count_vect.fit_transform(final['Text'].values)

final_counts.shape

type(final_counts)

positive_negative = final['Score']

#Below is giving error
final_counts = hstack((final_counts,positive_negative))

回答1:

sparse.hstack combines the coo format matrices of the inputs into a new coo format matrix.

final_counts is a csr matrix, so the sparse.coo_matrix(final_counts) conversion is trivial.

positive_negative is a column of a DataFrame. Look at

 sparse.coo_matrix(positive_negative)

It probably is a (1,n) sparse matrix. But to combine it with final_counts it needs to be (1,n) shaped.

Try creating the sparse matrix, and transposing it:

sparse.hstack((final_counts, sparse.coo_matrix(positive_negative).T))

回答2:

Used Below but still getting error

merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(positive_negative).T))

Below is the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, sparse.coo_matrix(positive_
negative).T))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sparse' is not defined
>>> merged_data = scipy.sparse.hstack((final_counts, scipy.sparse.coo_matrix(pos
itive_negative).T))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 464, in h
stack
    return bmat([blocks], format=format, dtype=dtype)
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 600, in b
mat
    dtype = upcast(*all_dtypes) if all_dtypes else None
  File "C:\Python34\lib\site-packages\scipy\sparse\sputils.py", line 52, in upca
st
    raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))

回答3:

Even I was facing the same issue with sparse matrices. you can convert the CSR matrix to dense by todense() and then you can use np.hstack((dataframe.values,converted_dense_matrix)). It will work fine. you can't deal with sparse matrices by using numpy.hstack
However for very large data set converting to dense matrix is not a good idea. In your case scipy hstack won't work because the data types are different in hstack(int,object). Try positive_negative = final['Score'].values and scipy.sparse.hstack it. if it doesn't work can you give me the output of your positive_negative.dtype

来源：https://stackoverflow.com/questions/51700979/hstack-csr-matrix-with-pandas-array

标签

pandas

numpy

scipy

sparse-matrix