nvidia

Copying 2D arrays to GPU of known variable width

纵然是瞬间 提交于 2019-12-02 08:08:43
问题 I am looking into how to copy a 2D array of variable width for each row into the GPU. int rows = 1000; int cols; int** host_matrix = malloc(sizeof(*int)*rows); int *d_array; int *length; ... Each host_matrix[i] might have a different length, which I know length[i] , and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it? According to this thread, that won't be a clever way of doing it: cudaMalloc(d_array, rows*sizeof(int*)); for(int

how to find the active SMs?

一曲冷凌霜 提交于 2019-12-02 07:23:23
Is there any way by which I can know the number of free/active SMs? Or atleast to read the voltage/power or temperature values of each SM by which I can know whether its working or not? (in real time while some job is getting executed on the gpu device). %smid helped me in knowing the Id of each SM. Something similar would be helpful. Thanks and Regards, Rakesh The CUDA Profiling Tools Interface ( CUPTI ) contains an Events API that enables run time sampling of GPU PM counters. The CUPTI SDK ships as part of the CUDA Toolkit. Documentation on sampling can be found in the section CUPTI Events

SYCL exception caught: Error: [ComputeCpp:RT0101] Failed to create kernel ((Kernel Name: SYCL_class_multiply))

喜欢而已 提交于 2019-12-02 06:41:07
I cloned https://github.com/codeplaysoftware/computecpp-sdk.git and modified the computecpp-sdk/samples/accessors/accessors.cpp file. I just added std::cout << "SYCL exception caught: " << e.get_cl_code() << '\n'; . See the fully modified code : /*************************************************************************** * * Copyright (C) 2016 Codeplay Software Limited * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * *

cudamemcpyasync and streams behaviour understanding

这一生的挚爱 提交于 2019-12-02 05:50:21
I have this simple code shown below which is doing nothing but just copies some data to the device from host using the streams. But I am confused after running the nvprof as to cudamemcpyasync is really async and understanding of the streams. #include <stdio.h> #define NUM_STREAMS 4 cudaError_t memcpyUsingStreams (float *fDest, float *fSrc, int iBytes, cudaMemcpyKind eDirection, cudaStream_t *pCuStream) { int iIndex = 0 ; cudaError_t cuError = cudaSuccess ; int iOffset = 0 ; iOffset = (iBytes / NUM_STREAMS) ; /*Creating streams if not present */ if (NULL == pCuStream) { pCuStream = (cudaStream

Tobii Eye Tracker

可紊 提交于 2019-12-02 05:05:27
We are trying to connect our Tobii Eye Tracker to our Ubuntu OS 16.04.6 LTS Nvidia Jetson TX2 module. However, when we want to pip install tobii_research we keep getting an error that says that there are not matching distributions found for it. Has anyone had any success doing this? We are using a virtual environment for python 3.5 and we are trying to install psychopy but it keeps saying that it is failing with error code 1 in /tmp/pip-install-cdg_if0d/psychopy. Do we need psychopy inorder to do the pip install tobii_research? We also have a library that is called "tobiiresearch" and it has a

Practice computing grid size for CUDA

时间秒杀一切 提交于 2019-12-02 04:06:40
dim3 block(4, 2) dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y); I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows. Can you explain how the grid size is computed? Why is block.x added to nx and then subtracted by 1 ? There is a preview ( https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false ) but page 53 is missing. This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid")

CUDA: Thread ID assignment in 2D grid

时光怂恿深爱的人放手 提交于 2019-12-02 03:09:00
Let's suppose I have a kernel call with a 2D grid, like so: dim3 dimGrid(x, y); // not important what the actual values are dim3 dimBlock(blockSize, blockSize); myKernel <<< dimGrid, dimBlock >>>(); Now I've read that multidimensional grids are merely meant to ease programming - the underlying hardware will only ever use 1D linearly cached memory (unless you use texture memory, but that's not relevant here). My question is: In what order will the threads be assigned to the grid indices during warp scheduling? Will they be assigned horizontally ("iterate" x, then y) or vertically ("iterate" y,

SLURM: After allocating all GPUs no more cpu job can be submitted

末鹿安然 提交于 2019-12-02 02:48:06
问题 We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. ​I have configured gres.conf and srun works. The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible. The error message says that it could not allocate

Difference on creating a CUDA context

烈酒焚心 提交于 2019-12-02 01:07:58
问题 I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows: __global__ void warmStart(int* f) { *f = 0; } which is launched before the kernels I want to time as follows: int *dFlag = NULL; cudaMalloc( (void**)&dFlag, sizeof(int) ); warmStart<<<1, 1>>>(dFlag); Check_CUDA_Error("warmStart kernel"); I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize() . But using these API

SLURM: After allocating all GPUs no more cpu job can be submitted

≯℡__Kan透↙ 提交于 2019-12-02 00:08:41
We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. ​I have configured gres.conf and srun works. The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible. The error message says that it could not allocate required resources (even though there are 24 CPU cores). This is my gres.conf: Name=gpu Type=titanx File=/dev