loop-unrolling

template arguments inside a compile time unrolled for loop?

*爱你&永不变心* 提交于 2019-12-21 05:06:44
问题 wikipedia (here) gives a compile time unrolling of for loop....... i was wondering can we use a similar for loop with template statements inside... for example... is the following loop valid template<int max_subdomain> void Device<max_sudomain>::createSubDomains() { for(int i=0; i< max_subdomain; ++i) { SubDomain<i> tmp(member); ... // some operations on tmp ... } } SubDomain is a class which takes in the a template parameter int and here has been constructed with an argument that is a member

Java JIT loop unrolling policy?

痞子三分冷 提交于 2019-12-19 16:54:19
问题 What is the loop unrolling policy for JIT? Or if there is no simple answer to that, then is there some way i can check where/when loop unrolling is being performed in a loop? GNode child = null; for(int i=0;i<8;i++){ child = octree.getNeighbor(nn, i, MethodFlag.NONE); if(child==null) break; RecurseForce(leaf, child, dsq, epssq); } Basically, i have a piece of code above that has a static number of iterations (eight), and it does bad when i leave the for loop as it is. But when i manually

Java JIT loop unrolling policy?

自闭症网瘾萝莉.ら 提交于 2019-12-19 16:54:06
问题 What is the loop unrolling policy for JIT? Or if there is no simple answer to that, then is there some way i can check where/when loop unrolling is being performed in a loop? GNode child = null; for(int i=0;i<8;i++){ child = octree.getNeighbor(nn, i, MethodFlag.NONE); if(child==null) break; RecurseForce(leaf, child, dsq, epssq); } Basically, i have a piece of code above that has a static number of iterations (eight), and it does bad when i leave the for loop as it is. But when i manually

Self-unrolling macro loop in C/C++

巧了我就是萌 提交于 2019-12-17 15:57:18
问题 I am currently working on a project, where every cycle counts. While profiling my application I discovered that the overhead of some inner loop is quite high, because they consist of just a few machine instruction. Additionally the number of iterations in these loops is known at compile time. So I thought instead of manually unrolling the loop with copy & paste I could use macros to unroll the loop at compile time so that it can be easily modified later. What I image is something like this:

Determining the optimal value for #pragma unroll N in CUDA

我怕爱的太早我们不能终老 提交于 2019-12-12 14:27:17
问题 I understand how #pragma unroll works, but if I have the following example: __global__ void test_kernel( const float* B, const float* C, float* A_out) { int j = threadIdx.x + blockIdx.x * blockDim.x; if (j < array_size) { #pragma unroll for (int i = 0; i < LIMIT; i++) { A_out[i] = B[i] + C[i]; } } } I want to determine the optimal value for LIMIT in the kernel above which will be launched with x number of threads and y number of blocks. The LIMIT can be anywhere from 2 to 1<<20 . Since 1

Porting duff's device from C to JavaScript

廉价感情. 提交于 2019-12-12 13:34:26
问题 I have this kind of Duff's device in C and it works fine (format text as money): #include <stdio.h> #include <string.h> char *money(const char *src, char *dst) { const char *p = src; char *q = dst; size_t len; len = strlen(src); switch (len % 3) { do { *q++ = ','; case 0: *q++ = *p++; case 2: *q++ = *p++; case 1: *q++ = *p++; } while (*p); } *q++ = 0; return dst; } int main(void) { char str[] = "1234567890123"; char res[32]; printf("%s\n", money(str, res)); return 0; } Output: 1,234,567,890

Should I look into PTX to optimize my kernel? If so, how?

筅森魡賤 提交于 2019-12-12 10:37:03
问题 Do you recommend reading your kernel's PTX code to find out to optimize your kernels further? One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code. Are there other use-cases for the PTX code? Do you look into your PTX code? Where can I find out how to be able to read the PTX code CUDA generates for my kernels? 回答1: The first point to make about PTX is that it

How to tell the compiler to unroll this loop [duplicate]

落爺英雄遲暮 提交于 2019-12-08 19:17:33
This question already has answers here : Tell gcc to specifically unroll a loop (3 answers) Closed 6 years ago . I have the following loop that I am running on an ARM processor. // pin here is pointer to some part of an array for (i = 0; i < v->numelements; i++) { pe = pptr[i]; peParent = pe->parent; SPHERE *ps = (SPHERE *)(pe->data); pin[0] = FLOAT2FIX(ps->rad2); pin[1] = *peParent->procs->pe_intersect == &SphPeIntersect; fixifyVector( &pin[2], ps->center ); // Is an inline function pin = pin + 5; } By the slow performance of the loop, I can judge that the compiler was unable to unroll this

Force/Convince/Trick GCC into Unrolling _Longer_ Loops?

纵饮孤独 提交于 2019-12-08 17:28:25
问题 How do I convince GCC to unroll a loop where the number of iterations is known, but large? I'm compiling with -O3 . The real code in question is more complex, of course, but here's a boiled-down example that has the same behavior: int const constants[] = { 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144 }; int get_sum_1() { int total = 0; for (int i = 0; i < CONSTANT_COUNT; ++i) { total += constants[i]; } return total; } ...if CONSTANT_COUNT is defined as 8 (or less) then GCC will unroll the

How to tell the compiler to unroll this loop [duplicate]

只愿长相守 提交于 2019-12-08 07:44:35
问题 This question already has answers here : Tell gcc to specifically unroll a loop (3 answers) Closed 6 years ago . I have the following loop that I am running on an ARM processor. // pin here is pointer to some part of an array for (i = 0; i < v->numelements; i++) { pe = pptr[i]; peParent = pe->parent; SPHERE *ps = (SPHERE *)(pe->data); pin[0] = FLOAT2FIX(ps->rad2); pin[1] = *peParent->procs->pe_intersect == &SphPeIntersect; fixifyVector( &pin[2], ps->center ); // Is an inline function pin =