Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

后端 未结 6 1399
北荒
北荒 2020-11-30 08:48

Is there a way to convert from a pandas.SparseDataFrame to scipy.sparse.csr_matrix, without generating a dense matrix in memory?

sc         


        
6条回答
  •  眼角桃花
    2020-11-30 09:48

    The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%. My code also handles dense columns appropriately.

    def sparse_df_to_array(df):
        num_rows = df.shape[0]   
    
        data = []
        row = []
        col = []
    
        for i, col_name in enumerate(df.columns):
            if isinstance(df[col_name], pd.SparseSeries):
                column_index = df[col_name].sp_index
                if isinstance(column_index, BlockIndex):
                    column_index = column_index.to_int_index()
    
                ix = column_index.indices
                data.append(df[col_name].sp_values)
                row.append(ix)
                col.append(len(df[col_name].sp_values) * [i])
            else:
                data.append(df[col_name].values)
                row.append(np.array(range(0, num_rows)))
                col.append(np.array(num_rows * [i]))
    
        data_f = np.concatenate(data)
        row_f = np.concatenate(row)
        col_f = np.concatenate(col)
    
        arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
        return arr.tocsr()
    

提交回复
热议问题