This is possible to make the return_address an array of dwords and let each thread access return_address at an unique index computed by an one to one injective function of it's unique identifier.
This change makes nrz's accepted answer works also for multithreaded code as well!