I am interested in determining the musical key of an audio sample. How would (or could) an algorithm go about trying to approximate the key of a musical audio sample?
<
I have worked at the problem of transcribing polyphonic CD recordings into scores for more than two years at university. The problem is notoriously hard. The first scientific papers related to the problem date back to the 1940s and up to today there are no robust solutions for the general case.
All the basic assumption you usually read are not exactly right and most of them are wrong enough that they become unusable for everything but very simple scenarios.
The frequencies of overtones are not multiples of the fundamental frequency - there are non-linear effects so that the high partials drift away from the expected frequency - and not only a few Hertz; it is not unusual to find the 7th partial where you expected the 6th.
Fourier transformations do not play nice with audio analysis because the frequencies one is interested in are spaced logarithmically while the Fourier transformation yields linearly spaced frequencies. At low frequencies you need high frequency resolution to separate neighboring pitches - but this yields bad time resolution and you lose the ability the separate notes played in quick succession.
An audio recording does (probably) not contain all the information needed to reconstruct the score. A large part of our music perception happens in our ears and brain. That is why some of the most successful systems are expert systems with large knowledge repositories about the structure of (western) music that only rely to a small portion on signal processing to extract information from the audio recording.
When I am back home I will look through the papers I have read and pick the 20 or 30 most relevant ones and add them here. I really suggest to read them before you decide to implement something - as stated before most common assumptions are somewhat incorrect and you really don't want to rediscover all this things found and analyzed for more than 50 year while implementing and testing.
It's a hard problem, but it's much fun, too. I would really like to hear what you tried and how well it worked.
For now you may have a look at the Constant Q transform, Cepstrum and Wigner(–Ville) distribution. There are also some good papers on how to extract the frequency from shifts in the phase of short time Fourier spectra - this allows to use very short windows sizes (for high time resolution) because the frequency can be determined with a precision several 1000 times larger than the frequency resolution of the underlying Fourier transformation.
All this transformations fit the problem of audio processing much better than ordinary Fourier transformations. For improving the results of basic transformations have a look at the concept of energy reassignment.