The theoretical maximum of memory bandwidth for a Core 2 processor with DDR3 dual channel memory is impressive: According to the Wikipedia article on the architecture, 10+
You could write your own. Try using the intel optimising compiler to directly target the architecture?
Intel also produce something called VTune (compiler and language independent) for optimising applications.
Here's an article on optimising a game engine.