gpgpu

Processing Shared Work Queue Using CUDA Atomic Operations and Grid Synchronization

左心房为你撑大大i 提交于 2020-12-21 02:44:37
问题 I’m trying to write a kernel whose threads iteratively process items in a work queue. My understanding is that I should be able to do this by using atomic operations to manipulate the work queue (i.e., grab work items from the queue and insert new work items into the queue), and using grid synchronization via cooperative groups to ensure all threads are at the same iteration (I ensure the number of thread blocks doesn’t exceed the device capacity for the kernel). However, sometimes I observe

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

橙三吉。 提交于 2020-11-27 02:00:23
问题 List of Nvidia GPU - GeForce 900 Series - there is written that: 4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed. I.e. for example for GeForce GTX 970 we can calculate performance: 1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops) This value we can see in column - Processing Power (peak) GFLOPS - Single Precision. But why we must multiple by 2 ? There is written: http://devblogs.nvidia.com/parallelforall

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

南笙酒味 提交于 2020-11-27 01:55:35
问题 List of Nvidia GPU - GeForce 900 Series - there is written that: 4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed. I.e. for example for GeForce GTX 970 we can calculate performance: 1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops) This value we can see in column - Processing Power (peak) GFLOPS - Single Precision. But why we must multiple by 2 ? There is written: http://devblogs.nvidia.com/parallelforall

Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

柔情痞子 提交于 2020-11-27 01:55:05
问题 List of Nvidia GPU - GeForce 900 Series - there is written that: 4 Single precision performance is calculated as 2 times the number of shaders multiplied by the base core clock speed. I.e. for example for GeForce GTX 970 we can calculate performance: 1664 Cores * 1050 MHz * 2 = 3 494 GFlops peak (3 494 400 MFlops) This value we can see in column - Processing Power (peak) GFLOPS - Single Precision. But why we must multiple by 2 ? There is written: http://devblogs.nvidia.com/parallelforall

开源项目OEIP 游戏引擎与音视频多媒体(UE4/Unity3D)

拜拜、爱过 提交于 2020-08-19 13:08:41
  现开源一个项目 OEIP 项目例子: 项目实现的功能Demo展示      这个项目演示了在UE4中,接入摄像机通过OEIP直接输出到UE4纹理上,并直接把UE4里的RenderTarget当做输入源通过OEIP里GPU管线处理后推流出去,而另一边Unity3D也是把RenderTarget当做输入,用OEIP处理后推流,经过OEIP封装signalR技术的直播SDK通知,二边各自拉另一边的流并通过OEIP相应管线直接输出到Texture2D并显示出来。演示的机器配置是i5-7500,8G内存,有二个推1080P,拉1080P流的处理,再加上生成截屏视频和yolov3-tiny神经网络识别,所以CPU有点吃不消。   这是我个人验证一些技术所搭建的DEMO级方案,接入了基本的普通摄像头处理,也没有提供稳定的直播供应商的实现,一些基本的图像处理,推拉流也只支持422P/420P格式。但是我自己还是花了大量业余时间在这方案上,并以及大热情来完善,不过业余时间毕竟有限,测试不完善,加上本人C++不是太熟悉,所以肯定有很多隐藏问题,欢迎指出问题,更欢迎提交修改。   本项目重点主要在图像处理并与游戏引擎的对接上,主要实现与游戏引擎对接更少的性能消耗,方便引入各种图像处理,包括相关神经网络图像处理,余下处理都是结合网上代码加上测试完善逻辑

Understanding device allocation, parallelism(tf.while_loop) and tf.function in tensorflow

最后都变了- 提交于 2020-05-16 06:45:44
问题 I'm trying to understand parallelism on GPU in tensorflow as I need to apply it on uglier graphs. import tensorflow as tf from datetime import datetime with tf.device('/device:GPU:0'): var = tf.Variable(tf.ones([100000], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32) @tf.function def foo(): return tf.while_loop(c, b, [i], parallel_iterations=1000) #tweak @tf.function def b(i): var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(i, [-1,1]), tf.constant([0], dtype=tf.dtypes.float32)))

Data sharing between CPU and GPU on modern x86 hardware with OpenCL or other GPGPU framework

▼魔方 西西 提交于 2020-05-11 07:50:10
问题 Progressing unification of CPU and GPU hardware, as evidenced by AMD Kaveri with hUMA (heterogeneous Uniform Memory Access) and Intel 4th generation CPUs, should allow copy-free sharing of data between CPU and GPU. I would like to know, if the most recent OpenCL (or other GPGPU framework) implementations allow true copy-free sharing (no explicit or implicit data copying) of large data structure between code running on CPU and GPU. 回答1: The ability to share data between host and device without

新冠病毒药物研发分秒必争,阿里高性能计算如何出力?

早过忘川 提交于 2020-03-25 07:20:21
3 月,跳不动了?>>> 阿里妹导读: 新冠状病毒疫情发生后,为了帮助抗攻击疫情,阿里云免费向全球公共科研机构提供高性能计算、SCC超级计算集群和>CPU/GPU机器、云超算及AI等技术。 近期,不少研究机构和高校在阿里云上E-HPC云超算上进行药物研发相关的数值计算,阿里云超算团队提供了技术支持与跟进。 本文主要介绍药物筛选阶段,E-HPC云超算如何帮助研发人员实现大量小分子库的快速并发处理。同时,介绍全球健康药物研发中心GHDDI算>力和成果共享开放平台的阿里云解决方案。 病毒、药物研发和高性能计算 一款药物的诞生周期极其漫长,从最早的新药研发到上市,至少要经历10年。 在疫情这般分秒必争的背景下,时间尤为珍贵。因此在本次过程中,许多科学家会尝试从已有的药物里面,找到能治疗新冠的药,免去了后续大量审批上市等步骤。 化合物发现阶段,以往的方法是通过大量实验做筛选,发现可能适合的化合物。如今,科学家尝试通过机器模拟分子化合物与靶点的相互作用,从而筛选出可能有效的化合物做实验。 在此过程中,高性能计算(HighPerformance Computing,简称HPC),常被称为“超算”,是现代药物研发必不可少的支持。 云计算的兴起更是改变了科学家获取算力、享受超算服务的方式。比如阿里云E-HPC 云超算产品,能够让科学家自助在云上搭建高性能集群系统,满足药物研发人员对计算平台的需求。