numba | 易学教程

Why this numba code is 6x slower than numpy code?

阅读更多关于 Why this numba code is 6x slower than numpy code?

问题 Is there any reason why the following code run in 2s, def euclidean_distance_square(x1, x2): return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1) while the following numba code run in 12s? @jit(nopython=True) def euclidean_distance_square(x1, x2): return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1) My x1 is a matrix of dimension (1, 512) and x2 is a matrix of dimension

Python: rewrite a looping numpy math function to run on GPU

阅读更多关于 Python: rewrite a looping numpy math function to run on GPU

问题 Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head around it but to no result. I wonder maybe somebody can help me rewrite this function in whatever way you may seem fit as log as I gives the same result at the end. I tried to use @jit from numba but for some reason it is actually much slower than running the code as usual. With a huge sample size, the goal is to decrease the execution time

The Anaconda prompt freezes when I run code with numba's “jit” decorator

阅读更多关于 The Anaconda prompt freezes when I run code with numba's “jit” decorator

问题 I have this python code that should run just fine. I'm running it on Anaconda's Spyder Ipython console, or on the Anaconda terminal itself, because that is the only way I can use the "numba" library and its "jit" decorator. However, either one always "freezes" or "hangs" just about whenever I run it. There is nothing wrong with the code itself, or else I'd get an error. Sometimes, the code runs all the way through perfectly fine, sometimes it just prints the first line from the first function

CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python

阅读更多关于 CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python

问题 I'm having this error when trying to run this code in Python using CUDA. I'm following this tutorial but i'm trying it in Windows 7 x64 machine. https://www.youtube.com/watch?v=jKV1m8APttU In fact, I run check_cuda() and all tests passed. Can anyone help me what is the exact issue here. My Code: import numpy as np from timeit import default_timer as timer from numbapro import vectorize, cuda @vectorize(['float64(float64, float64)'], target='gpu') def VectorAdd(a, b): return a + b def main():

Performance nested loop in numba

阅读更多关于 Performance nested loop in numba

问题 For performance reasons, I have started to use Numba besides NumPy. My Numba algorithm is working, but I have the feeling that it should be faster. There is one point which is slowing it down. Here is the code snippet: @nb.njit def rfunc1(ws, a, l): gn = a**l for x1 in range(gn): for x2 in range(gn): for x3 in range(gn): y = 0.0 for i in range(1, l): if numpy.all(ws[x1][0:i] == ws[x2][0:i]) and numpy.all(ws[x1][i:l] == ws[x3][i:l]): y += 1 if numpy.all(ws[x1][0:i] == ws[x2][0:i]) and numpy

Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array

阅读更多关于 Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array

问题 I have a (large) length-N array of k distinct functions, and a length-N array of abcissa. I want to evaluate the functions at the abcissa to return a length-N array of ordinates, and critically, I need to do it very fast. I have tried the following loop over a call to np.where, which is too slow: Create some fake data to illustrate the problem: def trivial_functional(i): return lambda x : i*x k = 250 func_table = [trivial_functional(j) for j in range(k)] func_table = np.array(func_table) #

Use numba to speed up for loop

阅读更多关于 Use numba to speed up for loop

问题 From what I've read, numba can significantly speed up a python program. Could my program's time efficiency be increased using numba? import numpy as np def f_big(A, k, std_A, std_k, mean_A=10, mean_k=0.2, hh=100): return ( 1 / (std_A * std_k * 2 * np.pi) ) * A * (hh/50) ** k * np.exp( -1*(k - mean_k)**2 / (2 * std_k **2 ) - (A - mean_A)**2 / (2 * std_A**2)) outer_sum = 0 dk = 0.000001 for k in np.arange(dk,0.4, dk): inner_sum = 0 for A in np.arange(dk, 20, dk): inner_sum += dk * f_big(A, k,

Python/Numba: Unknown attribute error with scipy.special.gammainc()

阅读更多关于 Python/Numba: Unknown attribute error with scipy.special.gammainc()

问题 I am having an error when running code using the @jit decorator. It appears that some information for the function scipy.special.gammainc() can't be located: Failed at nopython (nopython frontend) Unknown attribute 'gammainc' for Module(<module 'scipy.special' from 'C:\home\Miniconda\lib\site-packages\scipy\special\__init__.pyc'>) $164.2 $164.3 = getattr(attr=gammainc, value=$164.2) Without the @jit decorator the code will run fine. Maybe there is something required to make the attributes of

Can I perform dynamic cumsum of rows in pandas?

阅读更多关于 Can I perform dynamic cumsum of rows in pandas?

问题 If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1))) 0 0 0 1 2 2 8 3 1 4 0 5 0 6 7 7 0 8 2 9 2 Is there an efficient way cumsum rows with a limit and each time this limit is reached, to start a new cumsum . After each limit is reached (however many rows), a row is created with the total cumsum. Below I have created an example of a function that does this, but it's very slow, especially when the dataframe becomes very large. I don't like

Is it safe to implement cuda gridsync() in Numba like this

阅读更多关于 Is it safe to implement cuda gridsync() in Numba like this

问题 Numba lacks the cuda-C command gridsync() so there is not a canned method for syncing across an entire grid. Only block level syncs is available. If cudaKernal1 is a very fast execution time then the following code would run 1000x faster for i in range(10000): X = X + cudaKernel1[(100,100),(32,32)] (X) by putting the loop into the same kernel, to avoid the gpu kernel setup time. But you can't because you require all of the grid to finish before the next iteration can start, and there is no