Huge sparse dataframe to scipy sparse matrix without dense transform

这一生的挚爱 提交于 2019-12-04 09:40:06

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:

one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()

This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().

The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).

  1. http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
hpaulj

Does my answer from a few months back help?

Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

It was accepted but I didn't get any further feedback.

I'm familiar with the scipy sparse formats and their inputs, but don't know much about pandas sparse.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!