Cuda Fortran 4D array

元气小坏坏 提交于 2020-01-16 14:08:33

问题


My code is being slowed down by a my 4D arrays access in global memory.

I am using PGI compiler 2010.

The 4D array I am accessing is read only from the device and the size is known at run time.

I wanted to allocate to the texture memory and found that my PGI version does not support texture. As the size is known only at run time, it is not possible to use constant memory too.

Only One dimension is known at compile time like this MyFourD(100, x,y,z) where x,y,z are user input.

My first idea is about pointers but not familiar with pointer fortran.

If you have experience how to deal with such a situation, I will appreciate your help. Because only this makes my codes 5times slower than expected

Following is a sample code of what I am trying to do

int i,j,k

i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1

    do k = 0, 100 
        regvalue1 = somevalue1
        regvalue2 = somevalue2 
        regvalue3 =  somevalue3 

        d_value(i,j,k)=d_value(i,j,k)
     &     +myFourdArray(10,i,j,k)*regvalue1      
     &     +myFourdArray(32,i,j,k)*regvalue2      
     &     +myFourdArray(45,i,j,k)*regvalue3                    
    end do

Best regards,


回答1:


I believe the answer from @Alexander Vogt is on the right track - I would think about re-ordering the array storage. But I would try it like this:

int i,j,k

i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1

    do k = 0, 100 
        regvalue1 = somevalue1
        regvalue2 = somevalue2 
        regvalue3 =  somevalue3 

        d_value(i,j,k)=d_value(i,j,k)
     &     +myFourdArray(i,j,k,10)*regvalue1      
     &     +myFourdArray(i,j,k,32)*regvalue2      
     &     +myFourdArray(i,j,k,45)*regvalue3                    
    end do

Note that the only change is to myFourdArray, there is no need for a change in data ordering in the d_value array.

The crux of this change is that we are allowing adjacent threads to access adjacent elements in myFourdArray and so we are allowing for coalesced access. Your original formulation forced adjacent threads to access elements that were separated by the length of the first dimension, and so did not allow for useful coalescing.

Whether in CUDA C or CUDA Fortran, threads are grouped in X first, then Y and then Z dimensions. So the rapidly varying thread subscript is X first. Therefore, in matrix access, we want this rapidly varying subscript to show up in the index that is also rapidly varying.

In Fortran this index is the first of a multiple-subscripted array.

In C, this index is the last of a multiple-subscripted array.

Your original code followed this convention for d_value by placing the X thread index (i) in the first array subscript position. But it broke this convention for myFourdArray by putting a constant in the first array subscript position. Thus your access to myFourdArray are noticeably slower.

When there is a loop in the code, we also don't want to place the loop variable first (for Fortran, or last for C) (i.e. k, in this case, as Alexander Vogt did) because doing that will also break coalescing. For each iteration of the loop, we have multiple threads executing in lockstep, and those threads should all access adjacent elements. This is facilitated by having the X thread indexed subscript (e.g. i) first (for Fortran, or last for C).




回答2:


You could invert the indexing, i.e. let the first dimension change the Fastest. Fortran is column major!

do k = 0, 100 
    regvalue1 = somevalue1
    regvalue2 = somevalue2 
    regvalue3 =  somevalue3 

    d_value(k,i,j)=d_value(k,i,j) +         &
      myFourdArray(k,i,j,10)*regvalue1 +    &
      myFourdArray(k,i,j,32)*regvalue2 +    &
      myFourdArray(k,i,j,45)*regvalue3                   
end do

If the last (in the original case second) dimension is always fixed (and not too large), consider individual arrays instead.

In my experience, pointers do not change much in terms of speed-up when applied to large arrays. What you could try is strip-mining to optimize your loops in terms of cache access, but I do not know the compile option to enable this with the PGI compiler.

Ah, ok it is a simple directive:

!$acc do vector
do k=...
enddo


来源:https://stackoverflow.com/questions/18958634/cuda-fortran-4d-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!