gpu | 易学教程

arrayfun with function with inputs of different dimensions

阅读更多关于 arrayfun with function with inputs of different dimensions

问题 I'm trying to create a matrix that contains the averages of the kxk submatrices of a larger nxn matrix, where n is divisible by k. I can accomplish this fairly efficiently with something like this mat = mat2cell(mat, k*ones(1,n/k), k*ones(1,n/k)) mat = cellfun(@mean,mat,'UniformOutput',false); mat = cellfun(@mean,mat,'UniformOutput',false); %repeated to collapse cells to 1x1 mat = cell2mat(mat) However, since I have a very large amount of data all in very large matrices, repeating this

CUDA FFT exception

阅读更多关于 CUDA FFT exception

问题 I'm trying to use CUDA FFT aka cufft library Problem occured when cufftPlan1d(..) throws an exception. #define NX 256 #define BATCH 10 cufftHandle plan; cufftComplex *data; cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH); if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); return; } if (cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); return; } When the copiler hit the

Sub-Matrix computations

阅读更多关于 Sub-Matrix computations

问题 I want to calculate the pair wise distance between two sub-matrices of a matrix. For example I have a matrix A (MxN) and two blocks of that matrix B1 (mxn) and B2 (kxt). More specifically, I want to calculate the distance of the B1(1,1) element from all the other elements of the B2 and to do this process for all the B1 elements. To be more clear the B1 and B2 may be not compact parts of the matrices and basically the information I know is the coordinates of the elements of B1 and B2 on the

Time measuring in PyOpenCL

阅读更多关于 Time measuring in PyOpenCL

问题 I am running a kernel using PyOpenCL in a FPGA and in a GPU. In order to measure the time it takes to execute I use: t1 = time() event = mykernel(queue, (c_width, c_height), (block_size, block_size), d_c_buf, d_a_buf, d_b_buf, a_width, b_width) event.wait() t2 = time() compute_time = t2-t1 compute_time_e = (event.profile.end-event.profile.start)*1e-9 This provides me the execution time from the point of view of the host (compute_time) and from the device (compute_time_e). The problem is that

How to reuse tensorflow after cudaDeviceReset() in C++?

阅读更多关于 How to reuse tensorflow after cudaDeviceReset() in C++?

问题 I am working on a large cuda app in C++ that runs various models and needs to completely release all GPU memory or the other operations will fail. I am able to release all the memory after closing all tf sessions and running cudaDeviceReset(). But afterwards I cannot run any new tensorflow code and session creation will return nullptrs. I tried cudaDeviceSynchronize() before and after thinking that would help but no luck. I figured the call to InitMain would re-initialize tensorflow but it

How to make TensorFlow use 100% of GPU?

阅读更多关于 How to make TensorFlow use 100% of GPU?

问题 I have a laptop that has an RTX 2060 GPU and I am using Keras and TF 2 to train an LSTM on it. I am also monitoring the gpu use by nvidia-smi and I noticed that the jupyter notebook and TF are using maximum 35% and usually the gpu is being used between 10-25%. With current conditions, it took more than 7 hours to train this model, I want to know if I am doing something wrong or it is a limitation of Keras and TF? My nvidia-smi output: Sun Nov 3 00:07:37 2019 +---------------------------------

How to make TensorFlow use 100% of GPU?

阅读更多关于 How to make TensorFlow use 100% of GPU?

How can I check/release GPU-memory in tensorflow 2.0b?

阅读更多关于 How can I check/release GPU-memory in tensorflow 2.0b?

问题 In my tensorflow2.0b program I do get an error like this ResourceExhaustedError: OOM when allocating tensor with shape[727272703] and type int8 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TopKV2] The error occurs after a number of GPU-based operations within this program have been successfully executed. I like to release all GPU-memory associated with these past operations in order to avoid the above error. How can I do this in tensorflow-2.0b? How could I check

the Difference between running time and time of obtaining results in CUDA

阅读更多关于 the Difference between running time and time of obtaining results in CUDA

问题 I am trying to implement My algorithm on GPU using CUDA. this program work well but there is a problem. when I try to print out the results, they will be shown too late . here are some of my code. Assume True Results is not matter. __device__ unsigned char dev_state[128]; __device__ unsigned char GMul(unsigned char a, unsigned char b) { // Galois Field (256) Multiplication of two Bytes unsigned char p = 0; int counter; unsigned char hi_bit_set; for (counter = 0; counter < 8; counter++) { if (

the Difference between running time and time of obtaining results in CUDA

阅读更多关于 the Difference between running time and time of obtaining results in CUDA