Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

后端 未结 3 1029
逝去的感伤
逝去的感伤 2021-01-29 03:37

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.

If it makes a difference, the ints themselve range from 0 to 1

3条回答
  •  花落未央
    2021-01-29 04:08

    I would do the following:

    # create example data 
    A = np.random.randint(0,15000000,100)                                      
    B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
    

    int32 is sufficient

    A32 = A.astype(np.int32)
    

    We want to glue all the batches together. First, write down the batch sizes so we can separate them later.

    from itertools import chain
    
    sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
    boundaries = sizes.cumsum()
    
    # force int32
    B_all = np.empty(boundaries[-1],np.int32)
    np.concatenate(B,out=B_all)
    

    After glueing resplit.

    B32 = np.split(B_all, boundaries[1:-1])
    

    Finally, make an array of pairs for convenience:

    pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
    

    What was the point of glueing and then splitting again?

    First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.

    The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']

提交回复
热议问题