I am soon going to be tasked with doing a proper memory profile of a code that is written in C/C++ and uses CUDA to take advantage of GPU processing.
My initial thou
You could try Google's PerfTools' Heap-Profiler:
http://google-perftools.googlecode.com/svn/trunk/doc/heapprofile.html
It's very lightweight; it literally replaces malloc/calloc/realloc/free to add instrumentation code. It's primarily tested on Linux platforms.
If you have compiled with debugging symbols, and your third-party libraries come with debug-version variants, PerfTools should do very well. If you don't have debug-symbol libraries, build your code with debug symbols anyway. It would give you detailed numbers for your code, and all the leftover can be attributes to the third-party library.