opencl

Processor Affinity in OpenCL

十年热恋 提交于 2019-11-29 11:30:13
Can we impose procssor affinity in OpenCl? For example thread# 1 executes on procesor# 5, thread# 2 executes on procesor# 6, thread# 3 executes on procesor# 7, and so on ? Thanks You can't specify affinity at that low level with OpenCL as far as I know. But, starting with OpenCL 1.2 have some control over affinity by partitioning into subdevices using clCreateSubDevices (possibly with one processor in each subdevice by using CL_DEVICE_PARTITION_BY_COUNTS, 1 ) and running separate kernel executions on each subdevice. This would very likely run poorly on anything other than a CPU-based OpenCL

Access Path in Zero-Copy in OpenCL

断了今生、忘了曾经 提交于 2019-11-29 10:32:31
问题 I am a little bit confused about how exactly zero-copy work. 1- Want to confirm that the following corresponds to zero-copy in opencl. ....................... . . . . . . . . CPU . . SYSTEM . . . RAM . c3 X . . <=====> . ...|................... PCI-E / / | / / c2 |X /PCI-E, CPU directly accessing GPU memory | / / copy c3, c2 is avoided, indicated by X. ...|...././................ . MEMORY<====> . . OBJECT .c1 . . . GPU . . GPU RAM . . . . . ........................... .......................

Convenient way to show OpenCL error codes?

巧了我就是萌 提交于 2019-11-29 09:22:32
As per title, is there a convenient way to show readable OpenCL error codes? Being able to convert codes like '-1000' to a name would save a lot of time browsing through error codes. This is what I currently do. I believe the error list to be complete for OpenCL 1.2. cl_int result = clSomeFunction(); if(result != CL_SUCCESS) std::cerr << getErrorString(result) << std::endl; And getErrorString defined as follows: const char *getErrorString(cl_int error) { switch(error){ // run-time and JIT compiler errors case 0: return "CL_SUCCESS"; case -1: return "CL_DEVICE_NOT_FOUND"; case -2: return "CL

C# Rendering OpenCL-generated image

无人久伴 提交于 2019-11-29 08:57:48
Problem: I'm trying to render a dynamic Julia fractal in real time. Because the fractal is constantly changing, I need to be able to render at least 20 frames per second, preferably more. What you need to know about a Julia fractal is that every pixel can be calculated independently, so the task is easy parallelizable. First approach: Because I'm already used to Monogame in C#, I tried writing a shader in HLSL that would do the job, but the compiler kept complaining because I used up more than the allowable 64 arithmetic slots (I need at least a thousand). Second approach: Using the CPU, it

Does AMD's OpenCL offer something similar to CUDA's GPUDirect?

 ̄綄美尐妖づ 提交于 2019-11-29 08:17:31
问题 NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically: 1) Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network

What's the advantage of the local memory in OpenCL?

主宰稳场 提交于 2019-11-29 07:34:09
I'm wondering the advantage of the local memory in it. Since the global memory can get the item separately and freely. Can't we just use the global memory? For example, we have a 1000*1000 image, and we want add every pixel value 1. We can use 1000*1000's global memory right? Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts? I'll be so appreciate for you, if you give me a simple code of the local memory. Cann't we just use the global memory? Of course you can. First write an actual working code. Then optimize. Since the global memory can get

创龙基于TIAM5728浮点双DSPC66x+双ARMCortex-A15工业控制及高性能音视频处理器规格书

我怕爱的太早我们不能终老 提交于 2019-11-29 07:26:07
TL 5728 - IDK 是一款广州创龙基于 SOM-TL5728 核心板设计的开发板,底板采用沉金无铅工艺的4层板设计,它为用户提供了 SOM-TL5728 核心板的测试平台,用于 快速评估 SOM-TL5728 核心板的整体性能 。 不仅提供丰富的AM5728 入门教程和 Demo程序,还提供DSP+ARM 多 核通信开发教程,全面的技术支持,协助用户进行底板设计和调试以及DSP+ARM软件开发。 开发板简介 基于TI A M5728浮点 双 DSP C66x +双ARM Cor t ex-A15 工业 控制及高性能 音 视频处理器 ; 多核异构 CPU , 集成 双核 Cortex-A15 、 双核C66x浮点DSP、双核PRU-ICSS、两个 双核 Cortex-M4 IPU、 双核 GPU 等 处理单元 , 支持OpenCL、OpenMP 、 IPC多核 开发 ; 强劲的 视频编解码能力 ,支持1路1080P60或2路720 P60 或4路720 P30 视频硬件编 解码 , 支持H.265 视频 软解 码 ; 支持1路1080P60 HDMI 1. 4a 输出或1路LCD输出; 开发板引出V -PORT 视频输入 接口,可以灵活接入视频 输入 模块; 双核 PRU-ICSS 工业实时 控制子系统 ,支持EtherCAT、 EtherNet/IP 、 PROFIBUS

创龙基于TISitaraAM5728(浮点双DSPC66x+双ARMCortex-A15)+XilinxArtix-7FPGA开发板规格书

限于喜欢 提交于 2019-11-29 07:25:50
广州创龙基于 TI Sitara A M5728 ( 浮点 双 DSP C66x +双ARM Cor t ex-A15 ) + Xilinx A rtix -7 FPGA设计 的 TL 5728 F-EVM 开发板 是一款DSP+ARM+FPGA架构的开发平台,该 平台 适用于电力采集、电机控制器、雷达信号采集分析、医用仪器、机器视觉等领域。TL 5728 F-EVM 开发板 的底板采用沉金无铅工艺的6层板设计, 其 核心板内部AM5728通过GPMC总线与FPGA通信, 组成 DSP+ARM+FPGA 架构, 开发板 ARM端主要用于控制、显示、简单算法处理;DSP端主要用于复杂算法运算;FPGA端主要用于采集、缓存、算法处理、高速AD/DA控制、IO扩展等。 TL 5728 F-EVM 开发板 具有 丰富的接口, 广州创龙不仅为客户提 供丰富的Demo程序 以及DSP+ARM+FPGA多 核通信开发教程, 还提供长期、 全面的技术支持, 协助客户进行底板的 设计和调试以及 DSP+ARM+FPGA 软件开发 ,帮助客户以最快的速度进行产品的二次开发,实现产品的快速上市。 开发 板简介 基于TI Sitara A M5728 ( 浮点 双 DSP C66 x +双ARM Cor t ex-A15 ) + Xilinx A rtix -7 FPGA工业 控制及高性能 音 视频处理器

创龙TI AM5728浮点双DSP C66x +双ARM Cortex-A15开发板规格书,应用于音视频处理及电力控制

一笑奈何 提交于 2019-11-29 06:42:30
TL 5728 -EasyEVM是一款广州创龙基于 TI AM5728 ( 浮点 双 DSP C66 x +双ARM Cor t ex-A15 ) SOM-TL5728 核心板设计的开发板,它为用户提供了 SOM-TL5728 核心板的测试平台,用于快速评估 SOM-TL5728 核心板的整体性能。 TL 5728 -EasyEVM底板采用沉金无铅工艺的4层板设计,不仅为客户提供丰富的 AM5728 入门教程,还协助客户进行底板的开发,提供长期、全面的技术支持,帮助客户以最快的速度进行产品的二次开发,实现产品的快速上市。 不仅提供丰富的Demo程序,还提供DSP+ARM 多 核通信开发教程,全面的技术支持,协助用户进行底板设计和调试以及DSP+ARM软件开发。 开发板特点 基于TI A M5728浮点 双 DSP C66 x +双ARM Cor t ex-A15 工业 控制及高性能 音 视频处理器 ; 多核异构 CPU , 集成 双核 Cortex-A15 、 双核C66x浮点DSP、双核PRU-ICSS、 双核 Cortex-M4 IPU、 双核 GPU 等 处理单元 , 支持OpenCL、OpenMP 、 IPC多核 开发 ; 强劲的 视频编解码能力 ,支持1路1080P60或2路720 P60 或4路720 P30 视频硬件编 解码 , 支持H.265 视频 软解 码 ;

Are OpenCL work items executed in parallel?

心已入冬 提交于 2019-11-29 06:10:53
I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group. Does it mean that work items are executed in parallel? If so, is it possible/efficient to make 1 work group with 128 work items? The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency. On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items