I guess it depends on what you are processing. If you are calculating the FFT over a large duration you might find that it does take a while depending on how many frequency points you are wanting. However, in most cases for audio it is considered non-stationary (that is the signals mean and variance changes to much over time), so taking one large FFT (Periodogram PSD estimate) is not an accurate representation. Alternatively you could use Short-time Fourier transform, whereby you break the signal up into smaller frames and calculate the FFT. The frame size varies depending on how quickly the statistics change, for speech it is usually 20-40ms, for music I assume it is slightly higher.
This method is good if you are sampling from the microphone, because it allows you to buffer each frame at a time, calculate the fft and give what the user feels is "real time" interaction. Because 20ms is quick, because we can't really perceive a time difference that small.
I developed a small bench mark to test the difference between FFTW and KissFFT c-libraries on a speech signal. Yes FFTW is highly optimised, but when you are taking only short-frames, updating the data for the user, and using only a small fft size, they are both very similar. Here is an example on how to implement the KissFFT libraries in Android using LibGdx by badlogic games. I implemented this library using overlapping frames in an Android App I developed a few months ago called Speech Enhancement for Android.