You would need to implement your own 64 bit multiplication routine using 32 bit multiply operations. It's probably not going to be any more efficient than just doing this with scalar code though, particularly as there will be a lot of shuffling of the vectors to get all the required operations.