Faster alternatives to Pandas pivot_table

前端未结

关注

 3  1091

隐瞒了意图╮ 2021-02-11 09:33

I\'m using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it

3条回答

后悔当初 (楼主)

2021-02-11 10:15
You can use Sparse Matrices. They are fast to implement, a little bit restricted though. For example: You can't do indexing on a COO_matrix

I recently needed to train a recommmender system(lightFM) and it accepted sparse matrices as input, which made my job a lot easier. See it in action:
```
row  = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
mat = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
```
```
>>> print(mat)
  (0, 0)    4
  (3, 3)    5
  (1, 1)    7
  (0, 2)    9
>>> print(mat.toarray())
[[4 0 9 0]
 [0 7 0 0]
 [0 0 0 0]
 [0 0 0 5]]
```
As you can see, it automatically creates a pivot table for you using the columns and rows of the data you have and fills the rest with zeros. You can convert the sparse matrix into array and dataframe as well (df = pd.DataFrame.sparse.from_spmatrix(mat, index=..., columns=...))
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...