发表新帖

发表新帖

Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

后端未结

关注

 3  1029

逝去的感伤 2021-01-29 03:37

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.

If it makes a difference, the ints themselve range from 0 to 1

3条回答

花落未央 (楼主)

2021-01-29 04:08
I would do the following:
```
# create example data 
A = np.random.randint(0,15000000,100)                                      
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
```
int32 is sufficient
```
A32 = A.astype(np.int32)
```
We want to glue all the batches together. First, write down the batch sizes so we can separate them later.
```
from itertools import chain

sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()

# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)
```
After glueing resplit.
```
B32 = np.split(B_all, boundaries[1:-1])
```
Finally, make an array of pairs for convenience:
```
pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
```
What was the point of glueing and then splitting again?

First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.

The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题