group by on scipy sparse matrix

依然范特西╮ 提交于 2019-12-12 02:45:10

问题


I have a scipy sparse matrix with 10e6 rows and 10e3 columns, populated to 1%. I also have an array of size 10e6 which contains keys corresponding to the 10e6 rows of my sparse matrix. I want to group my sparse matrix following these keys and aggregate with a sum function.

Example:

Keys:
['foo','bar','foo','baz','baz','bar']

Sparse matrix:
(0,1) 3              -> corresponds to the first 'foo' key
(0,10) 4             -> corresponds to the first 'bar' key
(2,1) 1              -> corresponds to the second 'foo' key
(1,3) 2              -> corresponds to the first 'baz' key
(2,3) 10             -> corresponds to the second 'baz' key
(2,4) 1              -> corresponds to the second 'bar' key

Expected result:
{
    'foo': {1: 4},               -> 4 = 3 + 1
    'bar': {4: 1, 10: 4},        
    'baz': {3: 12}               -> 12 = 2 + 10
}

What is the more efficient way to do it?

I already tried to use pandas.SparseSeries.from_coo on my sparse matrix in order to be able to use pandas group by but I get this known bug:

site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    863         for obj in objs:
    864             if not isinstance(obj, NDFrame):
--> 865                 raise TypeError("cannot concatenate a non-NDFrame object")
    866 
    867             # consolidate

TypeError: cannot concatenate a non-NDFrame object

回答1:


I can generate your target with basic dictionary and list operations:

keys = ['foo','bar','foo','baz','baz','bar']
rows = [0,0,2,1,2,2]; cols=[1,10,1,3,3,4]; data=[3,4,1,2,10,1]
dd = {}
for i,k in enumerate(keys):
    d1 = dd.get(k, {})
    v = d1.get(cols[i], 0)
    d1[cols[i]] = v + data[i]
    dd[k] = d1
print dd

producing

{'baz': {3: 12}, 'foo': {1: 4}, 'bar': {10: 4, 4: 1}}

I can generate a sparse matrix from this data as well with:

import numpy as np
from scipy import sparse
M = sparse.coo_matrix((data,(rows,cols)))
print M
print
Md = M.todok()
print Md

But notice that the order of terms is not fixed. In the coo the order is as entered, but change format and the order changes. In other words the match between keys and the elements of the sparse matrix is unspecified.

  (0, 1)    3
  (0, 10)   4
  (2, 1)    1
  (1, 3)    2
  (2, 3)    10
  (2, 4)    1

  (0, 1)    3
  (1, 3)    2
  (2, 1)    1
  (2, 3)    10
  (0, 10)   4
  (2, 4)    1

Until you clear up this mapping, the initial dictionary approach is best.



来源:https://stackoverflow.com/questions/35410839/group-by-on-scipy-sparse-matrix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!