Your method might work for synthetic music using notes synchronized to fit your fft frame timing and length, and using only note sounds whose complete spectrum is compatible with your HPS pitch estimator. None of that is true for common music.
For the more general case, automatic music transcription still seems to be a research problem, with no simple 5 step solution. Pitch is a human psycho-acoustic phenomena. People will hear notes that may or may not be present in the local spectrum. The HPS pitch estimation algorithm is much more reliable than using the FFT peak, but can still fail for many kinds of musical sounds. Also, the FFT of any frames that cross note boundaries or transients may contain no clear single pitch to estimate.