I am reading a raw wave stream coming from the microphone.
(This part works as I can send it to the speaker and get a nice echo.)
For simplicity lets say I want
Let's say that typical DTMF frequency is 200Hz - 1000Hz. Then you'd have to detect a signal based on between 4 and 20 cycles. FFT will not get you anywhere I guess, since you'll detect only multiples of 50Hz frequencies: this is a built in feature of FFT, increasing the number of samples will not solve your problem. You'll have to do something more clever.
Your best shot is to linear least-square fit your data to
h(t) = A cos (omega t) + B sin (omega t)
for a given omega (one of the DTMF frequencies). See this for details (in particular how to set a statistical significance level) and links to the litterature.