slow-down when using OpenMP and calling subroutine in a loop

问题

Here I present a simple fortran code using OpenMP that calculate a summation of arrays multiple times. My computers has 6 cores with 12 threads and memory space of 16G.

There are two versions of this code. The first version has only 1 file test.f90 and the summation is implemented in this file. The code is presented as follows

program main
  implicit none

  integer*8 :: begin, end, rate
  integer i, j, k, ii, jj, kk, cnt
  real*8,allocatable,dimension(:,:,:)::theta, e

  allocate(theta(2000,50,5))
  allocate(e(2000,50,5))

  call system_clock(count_rate=rate)
  call system_clock(count=begin)

  !$omp parallel do
  do cnt = 1, 8
     do i = 1, 1001
        do j = 1, 50
           theta = theta+0.5d0*e
        end do
     end do       
  end do
  !$omp end parallel do

  call system_clock(count=end)
  write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

  deallocate(theta)
  deallocate(e)

end program main

This version has no problem on OpenMP and we can see acceleration.

The second version is modified such that the implementation of summation is written in a subroutine. There are two files, test.f90 and sub.f90 which are presented as follows

! test.f90
program main
  use sub
  implicit none

  integer*8 :: begin, end, rate
  integer i, j, k, ii, jj, kk, cnt

  call system_clock(count_rate=rate)
  call system_clock(count=begin)

  !$omp parallel do
  do cnt = 1, 8
    call summation()
  end do
  !$omp end parallel do

  call system_clock(count=end)
  write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

end program main

and

! sub.f90
module sub
  implicit none

contains

  subroutine summation()
    implicit none
    real*8,allocatable,dimension(:,:,:)::theta, e
    integer i, j

    allocate(theta(2000,50,5))
    allocate(e(2000,50,5))

    theta = 0.d0
    e = 0.d0

    do i = 1, 101
      do j = 1, 50
        theta = theta+0.5d0*e
      end do
    end do

    deallocate(theta)
    deallocate(e)

  end subroutine summation

end module sub

I also write a Makefile as follows

FC = ifort -O2 -mcmodel=large -qopenmp
LN = ifort -O2 -mcmodel=large -qopenmp

FFLAGS = -c
LFLAGS =

result: sub.o test.o
    $(LN) $(LFLAGS) -o result test.o sub.o

test.o: test.f90
    $(FC) $(FFLAGS) -o test.o test.f90

sub.o: sub.f90
    $(FC) $(FFLAGS) -o sub.o sub.f90

clean:
    rm result *.o*  *.mod *.e*

(we can use gfortran instead) However, we I run this version, there will be dramatic slow-down in using OpenMP and it is even much slower than the single-thread one (no OpenMP). So, what happened here and how to fix this ?

来源：https://stackoverflow.com/questions/47478641/slow-down-when-using-openmp-and-calling-subroutine-in-a-loop

标签

performance

fortran

openmp

subroutine