Python - Quick Upscaling of Array with Numpy, No Image Libary Allowed [duplicate]

问题

Note on duplicate message:

Similar themes, not exactly a duplicate. Esp. since the loop is still the fastest method. Thanks.

Goal:

Upscale an array from [small,small] to [big,big] by a factor quickly, don't use an image library. Very simple scaling, one small value will become several big values, after it is normalized for the several big values it becomes. In other words, this is "flux conserving" from an astronomical wording - a value of 16 from the small array spread into a big array's 4 values (factor of 2) would be 4 4's so the amount of the value has been retained.

Problem:

I've got some working codes to do the upscaling, but they don't work very fast compared to downscaling. Upscaling is actually easier than downscaling (which requires many sums, in this basic case) - upscaling just requires already-known data to be put in big chunks of a preallocated array.

For a working example, a [2,2] array of [16,24;8,16]:

16 , 24

8 , 16

Multiplied by a factor of 2 for a [4,4] array would have the values:

4 , 4 , 6 , 6

4 , 4 , 6 , 6

2 , 2 , 4 , 4

2 , 2 , 4 , 4

The fastest implementation is a for loop accelerated by numba's jit & prange. I'd like to better leverage Numpy's pre-compiled functions to get this job done. I'll also entertain Scipy stuff - but not its resizing functions.

It seems like a perfect problem for strong matrix manipulation functions, but I just haven't managed to make it happen quickly.

Additionally, the single-line numpy call is way funky, so don't be surprized. But it's what it took to get it to align correctly.

Code examples:

Check more optimized calls below Be warned, the case I have here makes a 20480x20480 float64 array that can take up a fair bit of memory - but can show off if a method is too memory intensive (as matrices can be).

Environment: Python 3, Windows, i5-4960K @ 4.5 GHz. Time to run for loop code is ~18.9 sec, time to run numpy code is ~52.5 sec on the shown examples.

% MAIN: To run these

import timeit

timeitSetup = ''' 
from Regridder1 import Regridder1
import numpy as np

factor = 10;

inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';

print("Time to run 1: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder1(inArray, factor,)", number = 10) ));

timeitSetup = ''' 
from Regridder2 import Regridder2
import numpy as np

factor = 10;

inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';

print("Time to run 2: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder2(inArray, factor,)", number = 10) ));

% FUN: Regridder 1 - for loop

import numpy as np
from numba import prange, jit

@jit(nogil=True)
def Regridder1(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
    outArray = np.zeros(outSize); #preallcoate
    outBlocks = inArray/outBlockSize; #precalc the resized blocks to go faster
    for i in prange(0,inSize[0]):
        for j in prange(0,inSize[1]):
            outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j]; #puts normalized value in a bunch of places

    return outArray;

% FUN: Regridder 2 - numpy

import numpy as np

def Regridder2(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels

    outArray = inArray.repeat(factor).reshape(inSize[0],factor*inSize[1]).T.repeat(factor).reshape(inSize[0]*factor,inSize[1]*factor).T/outBlockSize;

    return outArray;

Would greatly appreciate insight into speeding this up. Hopefully code is good, formulated it in the text box.

Current best solution:

On my comp, the numba's jit for loop implementation (Regridder1) with jit applied to only what needs it can run the timeit test at 18.0 sec, while the numpy only implementation (Regridder2) runs the timeit test at 18.5 sec. The bonus is that on the first call, the numpy only implementation doesn't need to wait for jit to compile the code. Jit's cache=True lets it not compile on subsequent runs. The other calls (nogil, nopython, prange) don't seem to help but also don't seem to hurt. Maybe in future numba updates they'll do better or something.

For simplicity and portability, Regridder2 is the best option. It's nearly as fast, and doesn't need numba installed (which for my Anaconda install required me to go install it) - so it'll help portability.

% FUN: Regridder 1 - for loop

import numpy as np

def Regridder1(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor #the block size where 1 inArray pixel is spread across # outArray pixels
    outArray = np.empty(outSize) #preallcoate
    outBlocks = inArray/outBlockSize #precalc the resized blocks to go faster
    factor = np.int64(factor) #convert to an integer to be safe (in case it's a 1.0 float)

    outArray = RegridderUpscale(inSize, factor, outArray, outBlocks) #call a function that has just the loop

    return outArray;
#END def Regridder1

from numba import jit, prange
@jit(nogil=True, nopython=True, cache=True) #nopython=True, nogil=True, parallel=True, cache=True
def RegridderUpscale(inSize, factor, outArray, outBlocks ):
    for i in prange(0,inSize[0]):
        for j in prange(0,inSize[1]):
            outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j];
        #END for j
    #END for i
    #scales the original data up, note for other languages you need i*factor+factor-1 because slicing
    return outArray; #return success
#END def RegridderUpscale

% FUN: Regridder 2 - numpy based on @ZisIsNotZis's answer

import numpy as np

def Regridder2(inArray,factor):
    inSize = np.shape(inArray);
    #outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))]; #whoops

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels

    outArray = np.broadcast_to( inArray[:,None,:,None]/outBlockSize, (inSize[0], factor, inSize[1], factor)).reshape(np.int64(factor*inSize[0]), np.int64(factor*inSize[1])); #single line call that gets the job done

    return outArray;
#END def Regridder2

回答1:

I did some benchmarks about this using a 512x512 byte image (10x upscale):

a = np.empty((512, 512), 'B')

Repeat Twice

>>> %timeit a.repeat(10, 0).repeat(10, 1)
127 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Repeat Once + Reshape

>>> %timeit a.repeat(100).reshape(512, 512, 10, 10).swapaxes(1, 2).reshape(5120, 5120)
150 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The two methods above all involve copying twice, while two methods below all copies once.

Fancy Indexing

Since t can be repeatedly used (and pre-computed), it is not timed.

>>> t = np.arange(512, dtype='B').repeat(10)
>>> %timeit a[t[:,None], t]
143 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Viewing + Reshape

>>> %timeit np.broadcast_to(a[:,None,:,None], (512, 10, 512, 10)).reshape(5120, 5120)
29.6 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It seems that viewing + reshape wins (at least on my machine). The test result on 2048x2048 byte image is the following where view + reshape still wins

2.04 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.4 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
424 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while the result for 2048x2048 float64 image is

3.14 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.07 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.56 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which, though the itemsize is 8 times larger, didn't take much more time

回答2:

Some new functions which show that order of operations is important :

import numpy as np
from numba import jit

A=np.random.rand(2048,2048)

@jit
def reg1(A,factor):
    factor2=factor**2
    a,b = [factor*s for s in A.shape]
    B=np.empty((a,b),A.dtype)
    Bf=B.ravel()
    k=0
    for i in range(A.shape[0]):
        Ai=A[i]
        for _ in range(factor):
            for j in range(A.shape[1]):
                x=Ai[j]/factor2
                for _ in range(factor):
                    Bf[k]=x
                    k += 1
    return B   

def reg2(A,factor):
    return np.repeat(np.repeat(A/factor**2,factor,0),factor,1)

def reg3(A,factor):
    return np.repeat(np.repeat(A/factor**2,factor,1),factor,0)

def reg4(A,factor):
    shx,shy=A.shape
    stx,sty=A.strides
    B=np.broadcast_to((A/factor**2).reshape(shx,1,shy,1),
    shape=(shx,factor,shy,factor))
    return B.reshape(shx*factor,shy*factor)

And runs :

In [47]: %timeit _=Regridder1(A,5)
672 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [48]: %timeit _=reg1(A,5)
522 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [49]: %timeit _=reg2(A,5)
1.23 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit _=reg3(A,5)
782 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit _=reg4(A,5)
860 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""

来源：https://stackoverflow.com/questions/53330908/python-quick-upscaling-of-array-with-numpy-no-image-libary-allowed

标签

python

arrays

numpy

scaling