Intel VTune is good and is non-instrumenting. We evaluated a whole bunch of profilers for Windows, and this was the best for working with driver code (though it does unmanaged user level code as well). A particular strength is that it reads all the Intel processor performance counters, so you can get a good understanding of why your code is running slowly, and it was useful for putting prefetch instructions into our code and sorting out data layout to work well with the cache lines, and the way cache lines get invalidated in multi core systems.
It is commercial, and I have to say it isn't the easiest UI in the world.