问题
See further update below
I am observing a quiet high system CPU usage when running my Fortran code. The "user CPU usage" is taking about one core (system is an Intel i7 with 4 cores/ 8 threads, running Linux) whilst system CPU is eating up about 2 cores (hence overall CPU usage about 75%). Can anyone explain to me where this is coming from and if this is "normal" behaviour?
I compile the code with gfortran (optimization turned off -O0, though that part doesn't seem to matter) and link against BLAS, LAPACK and some (other) C-functions. My own code is not using any parallelization and neither does the linked code (as far as I can tell). At least I am not using any parallelized library versions.
The code itself is about assembling and solving finite element systems and uses a lot (?) of allocating and intrinsic function calls (matmul, dot_product), though the overall RAM usage is pretty low (~200MB). I don't know if this information is sufficient/ useful, but I hope someone knows what is going on there.
Best regards, Ben
UPDATE I think I did track down (part of) the problem to a call to DSYEV from LAPACK (computes eigenvalues of a real symm. matrix A, in my case 3x3).
program test
implicit none
integer,parameter :: ndim=3
real(8) :: tens(ndim,ndim)
integer :: mm,nn
real(8), dimension(ndim,ndim):: eigvec
real(8), dimension(ndim) :: eigval
character, parameter :: jobz='v' ! Flags calculation of eigenvectors
character, parameter :: uplo='u' ! Flags upper triangular
integer, parameter :: lwork=102 ! Length of work array
real(8), dimension(lwork) :: work ! Work array
integer :: info
tens(1,:) = [1.d0, 2.d0, 3.d0]
tens(2,:) = [2.d0, 5.d0, 1.d0]
tens(3,:) = [3.d0, 1.d0, 1.d0]
do mm=1,5000000
eigvec=tens
! Call DSYEV
call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
enddo
write(*,*) eigvec
write(*,*) int(work(1))
endprogram test
The compiling and linking is done with
gfortran test.f90 -o test -llapack
This program is giving me very high %sys CPU usage. Can anyone verify this (obviously LAPACK is necessary to un the code)? Is this "normal" behaviour or is something wrong with my code/system/librariers...?
UPDATE 2 Encouraged by @roygvib's comment I ran the code on another system. On the second system, the high CPU sys usage could not be reproduced. Comparing the two systems I can't seem to find where this is coming from. Both run the same OS version (Linux Ubuntu), same gfortran version (4.8), Kernel Version, LAPACK and BLAS. "Major" difference: the processor is an i7-4770 on the buggy system and an i7-870 on the other. Running the test code on the buggy one is giving me about %user 16s and %sys 28s. On the i7-870 it is %user 16s %sys 0s. Running the code four times (parallel) gives me an overall timing for each process of about 18s on the other system and 44s on the buggy system. Any ideas what else I could look for?
UPDATE 3 I think we are getting closer: Building the test program on the other system with a static link to the LAPACK and BLAS library,
gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -Wl,--allow-multiple-definition
and running that code in the buggy system gives me a %sys time of about 0 (as desired). On the other hand, building the test program with static links to LAPACK and BLAS on the buggy system and running the code on the other system return high %sys CPU usage as well! So obviously, the libraries seem to differ, right? Building the static version on the buggy system results in a file size of about 18MB(!), on the other system 100KB. Additionaley I have to include the
-Wl,--allow-multiple-definition
command only on the other system (otherwise complains about multiple definitions of xerbla), whilst on the buggy system I have to (explicitly) link against libpthread
gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -lpthread -o test
The interesting thing is that
apt-cache policy liblapack*
returns the same versions and repo destinations for both systems (same goes for libblas*). Any further ideas? Maybe there is some other command to check library version that I don't know of?
回答1:
My interpretation of the slowdown:
A threaded (probably OpenMP) version of LAPACK and BLAS wes used. These try to launch several threads to solve the linear algebra problem in parallel. That often speeds-up the computation.
However in this case
do mm=1,5000000
eigvec=tens
call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
enddo
This is numerous times calling the library for a very small problem (a 3x3 matrix). This cannot be efficiently solved in parallel, the matrix is too small. The overhead connected with the synchronization of the threads dominates the solution time. The synchronization (if not even thread creation) is done 5000000 times!
Remedies:
use a non-threaded BLAS and LAPACK
if the parallelization is done using OpenMP set
OMP_NUM_THREADS=1
which means use only one threaddo not use LAPACK at all because for the special case 3x3 there are specialized algorithms available https://en.wikipedia.org/wiki/Eigenvalue_algorithm#3.C3.973_matrices
来源:https://stackoverflow.com/questions/35926940/fortran-lapack-high-cpu-sys-usage-with-dsyev-no-parallelization-normal