memory

CUDA ---- Memory Access

与世无争的帅哥 提交于 2020-01-11 21:02:42
Memory Access Patterns 大部分device一开始从global Memory获取数据,而且,大部分GPU应用表现会被带宽限制。因此最大化应用对global Memory带宽的使用时获取高性能的第一步。也就是说,global Memory的使用就没调节好,其它的优化方案也获取不到什么大效果,下面的内容会涉及到不少L1的知识,这部分了解下就好,L1在Maxwell之后就不用了,但是cache的知识点是不变的。 Aligned and Coalesced Access 如下图所示,global Memory的load/store要经由cache,所有的数据会初始化在DRAM,也就是物理的device Memory上,而kernel能够获取的global Memory实际上是一块逻辑内存空间。Kernel对Memory的请求都是由DRAM和SM的片上内存以128-byte和32-byte传输解决的。 所有获取global Memory都要经过L2 cache,也有许多还要经过L1 cache,主要由GPU的架构和获取模式决定的。如果L1和L2都被使用,那么Memory的获取是以128-byte为单位传输的,如果只使用L2,则以32-byte为单位传输,在允许使用L1的GPU中(Maxwell已经彻底不使用L1,原本走L1都换成走texture cache)

【并行计算-CUDA开发】CUDA ---- Warp解析

≯℡__Kan透↙ 提交于 2020-01-11 20:59:56
Warp 逻辑上,所有thread是并行的,但是,从硬件的角度来说,实际上并不是所有的thread能够在同一时刻执行,接下来我们将解释有关warp的一些本质。 Warps and Thread Blocks warp是SM的基本执行单元。一个warp包含32个并行thread,这32个thread执行于SMIT模式。也就是说所有thread执行同一条指令,并且每个thread会使用各自的data执行该指令。 block可以是一维二维或者三维的,但是,从硬件角度看,所有的thread都被组织成一维,每个thread都有个唯一的ID( ID的计算可以在之前的博文查看 )。 每个block的warp数量可以由下面的公式计算获得: 一个warp中的线程必然在同一个block中,如果block所含线程数目不是warp大小的整数倍,那么多出的那些thread所在的warp中,会剩余一些inactive的thread,也就是说,即使凑不够warp整数倍的thread,硬件也会为warp凑足,只不过那些thread是inactive状态,需要注意的是,即使这部分thread是inactive的,也会消耗SM资源。 Warp Divergence 控制流语句普遍存在于各种编程语言中,GPU支持传统的,C-style,显式控制流结构,例如if…else,for,while等等。

CUDA ---- Warp解析

耗尽温柔 提交于 2020-01-11 20:56:36
Warp 逻辑上,所有thread是并行的,但是,从硬件的角度来说,实际上并不是所有的thread能够在同一时刻执行,接下来我们将解释有关warp的一些本质。 Warps and Thread Blocks warp是SM的基本执行单元。一个warp包含32个并行thread,这32个thread执行于SMIT模式。也就是说所有thread执行同一条指令,并且每个thread会使用各自的data执行该指令。 block可以是一维二维或者三维的,但是,从硬件角度看,所有的thread都被组织成一维,每个thread都有个唯一的ID( ID的计算可以在之前的博文查看 )。 每个block的warp数量可以由下面的公式计算获得: 一个warp中的线程必然在同一个block中,如果block所含线程数目不是warp大小的整数倍,那么多出的那些thread所在的warp中,会剩余一些inactive的thread,也就是说,即使凑不够warp整数倍的thread,硬件也会为warp凑足,只不过那些thread是inactive状态,需要注意的是,即使这部分thread是inactive的,也会消耗SM资源。 Warp Divergence 控制流语句普遍存在于各种编程语言中,GPU支持传统的,C-style,显式控制流结构,例如if…else,for,while等等。

【并行计算-CUDA开发】CUDA ---- Warp解析

前提是你 提交于 2020-01-11 20:53:30
Warp 逻辑上,所有thread是并行的,但是,从硬件的角度来说,实际上并不是所有的thread能够在同一时刻执行,接下来我们将解释有关warp的一些本质。 Warps and Thread Blocks warp是SM的基本执行单元。一个warp包含32个并行thread,这32个thread执行于SMIT模式。也就是说所有thread执行同一条指令,并且每个thread会使用各自的data执行该指令。 block可以是一维二维或者三维的,但是,从硬件角度看,所有的thread都被组织成一维,每个thread都有个唯一的ID( ID的计算可以在之前的博文查看 )。 每个block的warp数量可以由下面的公式计算获得: 一个warp中的线程必然在同一个block中,如果block所含线程数目不是warp大小的整数倍,那么多出的那些thread所在的warp中,会剩余一些inactive的thread,也就是说,即使凑不够warp整数倍的thread,硬件也会为warp凑足,只不过那些thread是inactive状态,需要注意的是,即使这部分thread是inactive的,也会消耗SM资源。 Warp Divergence 控制流语句普遍存在于各种编程语言中,GPU支持传统的,C-style,显式控制流结构,例如if…else,for,while等等。

0xffff0 and the BIOS

戏子无情 提交于 2020-01-11 16:13:44
问题 When a pc first boots up, it starts executing at physical address 0xffff0. This address contains a jmp instruction to the BIOS. Now for my question, I always assume the physical addresses are mapped to RAM. If RAM initially contains garbage values, what exactly puts the jmp instruction in 0xffff0? Is the jmp instruction always the same or is it different for different BIOS's? Does 0xffff0 map from RAM to BIOS then (meaning it's "hard mapped")? 回答1: The top 64kB or so are mapped to BIOS ROM,

Exact state of committed memory in java

耗尽温柔 提交于 2020-01-11 15:30:30
问题 Im curious what the exact meaning of "committed" memory is when the value is queried from the MemoryUsage class. That class explains it as "committed represents the amount of memory (in bytes) that is guaranteed to be available for use by the Java virtual machine." Does this mean that the memory is in use by the jvm process and NOT available to other processes until it is released by the java process, or does it mean that the java process will be successful if it tries to allocate up to that

iOS Memory Warnings

血红的双手。 提交于 2020-01-11 14:11:35
问题 I'm trying to populate a collection view with images downloaded from a Parse database, but I'm receiving memory warnings followed by occasional crashes. Does anyone know how other apps manage to present so many images without crashing? Can someone show me how to optimize what I already have? Here's all the relevant code: https://gist.github.com/sungjp/99ae82dca625f0d73730 var imageCache : NSCache = NSCache() override func didReceiveMemoryWarning() { super.didReceiveMemoryWarning() // Dispose

variables retaining values after app close

只谈情不闲聊 提交于 2020-01-11 13:49:07
问题 My app is retaining all of the variable values when it closes and this is effecting how it runs when reopened. Is there any way to reset them all upon app close? or is there a way to clean the app from memory when it is closed so to speak? For the moment I have just been setting all of the important variables "=0" in the last few lines of execution but I know there must be a correct way to doing this. EDIT: OK I thought that it would just be easier to reply here instead of individually to

How is Stack memory allocated when using 'push' or 'sub' x86 instructions?

廉价感情. 提交于 2020-01-11 11:09:49
问题 I have been browsing for a while and I am trying to understand how memory is allocated to the stack when doing for example: push rax Or moving the stack pointer to allocate space for local variables of a subroutine: sub rsp, X ;Move stack pointer down by X bytes What I understand is that the stack segment is anonymous in the virtual memory space,i.e., not file backed. What I also understand is that the kernel will not actually map an anonymous virtual memory segment to physical memory until

How is Stack memory allocated when using 'push' or 'sub' x86 instructions?

半城伤御伤魂 提交于 2020-01-11 11:08:10
问题 I have been browsing for a while and I am trying to understand how memory is allocated to the stack when doing for example: push rax Or moving the stack pointer to allocate space for local variables of a subroutine: sub rsp, X ;Move stack pointer down by X bytes What I understand is that the stack segment is anonymous in the virtual memory space,i.e., not file backed. What I also understand is that the kernel will not actually map an anonymous virtual memory segment to physical memory until