FFTW vs Matlab FFT

后端 未结 4 1827
旧时难觅i
旧时难觅i 2020-12-14 14:44

I posted this on matlab central but didn\'t get any responses so I figured I\'d repost here.

I recently wrote a simple routine in Matlab that uses an FFT in a for-l

相关标签:
4条回答
  • 2020-12-14 15:30

    EDIT: @wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!

    First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.

    In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).

    In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.

    Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.

    X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    

    then

    for (size_t i=0; i<N; ++i) {
        X[i][0] = Xre[i];
        X[i][1] = Xim[i];
    }
    

    then

    for (size_t i=0; i<N; ++i) {
        Yre[i] = Y[i][0];
        Yim[i] = Y[i][1];
    }
    

    Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.

    I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.

    In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.

    EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.

    One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.

    Cheers.

    p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.

    0 讨论(0)
  • 2020-12-14 15:32

    This is classic performance gain thanks to low-level and architecture-specific optimization.

    Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.

    FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.

    0 讨论(0)
  • 2020-12-14 15:34

    A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:

    • Based on the code you have, I can see two explanations for the speed difference:
      • the speed difference is explained by differences in levels of optimization of the FFT
      • the while loop in MATLAB is executed a significantly smaller number of times

    I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)

    Now, regarding FFT speed comparison:

    • Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
    • I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
    • With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
    • Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.
    0 讨论(0)
  • 2020-12-14 15:37

    I have found the following comment on the MathWorks website [1]:

    Note on large powers of 2: For FFT dimensions that are powers of 2, between 2^14 and 2^22, MATLAB software uses special preloaded information in its internal database to optimize the FFT computation. No tuning is performed when the dimension of the FTT is a power of 2, unless you clear the database using the command fftw('wisdom', []).

    Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.

    [1] R2013b Documentation available from http://www.mathworks.de/de/help/matlab/ref/fftw.html (accessed on 29 Oct 2013)

    0 讨论(0)
提交回复
热议问题