Fortran LAPACK: high CPU %sys usage with DSYEV - no parallelization - normal?

问题

See further update below

I am observing a quiet high system CPU usage when running my Fortran code. The "user CPU usage" is taking about one core (system is an Intel i7 with 4 cores/ 8 threads, running Linux) whilst system CPU is eating up about 2 cores (hence overall CPU usage about 75%). Can anyone explain to me where this is coming from and if this is "normal" behaviour?

I compile the code with gfortran (optimization turned off -O0, though that part doesn't seem to matter) and link against BLAS, LAPACK and some (other) C-functions. My own code is not using any parallelization and neither does the linked code (as far as I can tell). At least I am not using any parallelized library versions.

The code itself is about assembling and solving finite element systems and uses a lot (?) of allocating and intrinsic function calls (matmul, dot_product), though the overall RAM usage is pretty low (~200MB). I don't know if this information is sufficient/ useful, but I hope someone knows what is going on there.

Best regards, Ben

UPDATE I think I did track down (part of) the problem to a call to DSYEV from LAPACK (computes eigenvalues of a real symm. matrix A, in my case 3x3).

program test

implicit none

integer,parameter :: ndim=3
real(8) :: tens(ndim,ndim)

integer :: mm,nn
real(8), dimension(ndim,ndim):: eigvec
real(8), dimension(ndim)   :: eigval

character, parameter    :: jobz='v'  ! Flags calculation of eigenvectors
character, parameter    :: uplo='u'  ! Flags upper triangular 
integer, parameter      :: lwork=102   ! Length of work array
real(8), dimension(lwork)  :: work      ! Work array
integer :: info   

tens(1,:) = [1.d0, 2.d0, 3.d0]
tens(2,:) = [2.d0, 5.d0, 1.d0]
tens(3,:) = [3.d0, 1.d0, 1.d0]   

do mm=1,5000000    
    eigvec=tens
   ! Call DSYEV
   call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
enddo

write(*,*) eigvec
write(*,*) int(work(1))

endprogram test

The compiling and linking is done with

gfortran test.f90 -o test -llapack

This program is giving me very high %sys CPU usage. Can anyone verify this (obviously LAPACK is necessary to un the code)? Is this "normal" behaviour or is something wrong with my code/system/librariers...?

UPDATE 2 Encouraged by @roygvib's comment I ran the code on another system. On the second system, the high CPU sys usage could not be reproduced. Comparing the two systems I can't seem to find where this is coming from. Both run the same OS version (Linux Ubuntu), same gfortran version (4.8), Kernel Version, LAPACK and BLAS. "Major" difference: the processor is an i7-4770 on the buggy system and an i7-870 on the other. Running the test code on the buggy one is giving me about %user 16s and %sys 28s. On the i7-870 it is %user 16s %sys 0s. Running the code four times (parallel) gives me an overall timing for each process of about 18s on the other system and 44s on the buggy system. Any ideas what else I could look for?

UPDATE 3 I think we are getting closer: Building the test program on the other system with a static link to the LAPACK and BLAS library,

gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -Wl,--allow-multiple-definition

and running that code in the buggy system gives me a %sys time of about 0 (as desired). On the other hand, building the test program with static links to LAPACK and BLAS on the buggy system and running the code on the other system return high %sys CPU usage as well! So obviously, the libraries seem to differ, right? Building the static version on the buggy system results in a file size of about 18MB(!), on the other system 100KB. Additionaley I have to include the

-Wl,--allow-multiple-definition

command only on the other system (otherwise complains about multiple definitions of xerbla), whilst on the buggy system I have to (explicitly) link against libpthread

gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -lpthread -o test

The interesting thing is that

apt-cache policy liblapack*

returns the same versions and repo destinations for both systems (same goes for libblas*). Any further ideas? Maybe there is some other command to check library version that I don't know of?

回答1:

My interpretation of the slowdown:

A threaded (probably OpenMP) version of LAPACK and BLAS wes used. These try to launch several threads to solve the linear algebra problem in parallel. That often speeds-up the computation.

However in this case

do mm=1,5000000    
   eigvec=tens
   call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
enddo

This is numerous times calling the library for a very small problem (a 3x3 matrix). This cannot be efficiently solved in parallel, the matrix is too small. The overhead connected with the synchronization of the threads dominates the solution time. The synchronization (if not even thread creation) is done 5000000 times!

Remedies:

use a non-threaded BLAS and LAPACK
if the parallelization is done using OpenMP set OMP_NUM_THREADS=1 which means use only one thread
do not use LAPACK at all because for the special case 3x3 there are specialized algorithms available https://en.wikipedia.org/wiki/Eigenvalue_algorithm#3.C3.973_matrices

来源：https://stackoverflow.com/questions/35926940/fortran-lapack-high-cpu-sys-usage-with-dsyev-no-parallelization-normal

标签

Linux

fortran

cpu-usage

lapack

blas