This is exactly what I'm doing here as my last year project :) except one thing that my project is about tracking the pitch of human singing voice (and I don't have the robot to play the tune)
The quickest way I can think of is to utilize BASS library. It contains ready-to-use function that can give you FFT data from default recording device. Take a look at "livespec" code example that comes with BASS.
By the way, raw FFT data will not enough to determine fundamental frequency. You need algorithm such as Harmonic Product Spectrum to get the F0.
Another consideration is the audio source. If you are going to do FFT and apply Harmonic Product Spectrum on it. You will need to make sure the input has only one audio source. If it contains multiple sources such as in modern songs there will be to many frequencies to consider.
Harmonic Product Spectrum Theory
If the input signal is a musical note,
then its spectrum should consist of a
series of peaks, corresponding to
fundamental frequency with harmonic
components at integer multiples of the
fundamental frequency. Hence when we
compress the spectrum a number of
times (downsampling), and compare it
with the original spectrum, we can see
that the strongest harmonic peaks line
up. The first peak in the original
spectrum coincides with the second
peak in the spectrum compressed by a
factor of two, which coincides with
the third peak in the spectrum
compressed by a factor of three.
Hence, when the various spectrums are
multiplied together, the result will
form clear peak at the fundamental
frequency.
Method
First, we divide the input signal into
segments by applying a Hanning window,
where the window size and hop size are
given as an input. For each window,
we utilize the Short-Time Fourier
Transform to convert the input signal
from the time domain to the frequency
domain. Once the input is in the
frequency domain, we apply the
Harmonic Product Spectrum technique to
each window.
The HPS involves two steps:
downsampling and multiplication. To
downsample, we compressed the spectrum
twice in each window by resampling:
the first time, we compress the
original spectrum by two and the
second time, by three. Once this is
completed, we multiply the three
spectra together and find the
frequency that corresponds to the peak
(maximum value). This particular
frequency represents the fundamental
frequency of that particular window.
Limitations of the HPS method
Some nice features of this method
include: it is computationally
inexpensive, reasonably resistant to
additive and multiplicative noise, and
adjustable to different kind of
inputs. For instance, we could change
the number of compressed spectra to
use, and we could replace the spectral
multiplication with a spectral
addition. However, since human pitch
perception is basically logarithmic,
this means that low pitches may be
tracked less accurately than high
pitches.
Another severe shortfall of the HPS
method is that it its resolution is
only as good as the length of the FFT
used to calculate the spectrum. If we
perform a short and fast FFT, we are
limited in the number of discrete
frequencies we can consider. In order
to gain a higher resolution in our
output (and therefore see less
graininess in our pitch output), we
need to take a longer FFT which
requires more time.
from: http://cnx.org/content/m11714/latest/