I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA).
The following instruction using [base+index]
addressing >
Older Intel processors without a uop cache can do the fusion, so maybe this is a drawback of the uop cache. I don't have the time to test this right now, but I will add a test for uop fusion next time I update my test scripts. Have you tried with FMA instructions? They are the only instructions that allow 3 input dependencies in an unfused uop.