The best way is probably writing a few simple test cases and then compile and debug them in assembler (all optimization off): running one instruction at a time you'll see where everything fits.
At least that's the way I learned it.
And if you find any case particularly challenging, post in in SO!