I am soon going to be tasked with doing a proper memory profile of a code that is written in C/C++ and uses CUDA to take advantage of GPU processing.
My initial thou
If you don't want to use an "external" tool, you can try to use tools like:
mtrace
It installs handlers for malloc, realloc and free and log every operation to a file. See the Wikipedia I lined for code usage examples.
dmalloc
It's a library you can use in your code, and can find memory leaks, off-by-one errors and usage of invalid addresses. You can also disable it at compile time with -DDMALLOC_DISABLE.
Anyway, I would rather not get this approach. Instead, I suggest you to try and stress test your application while running it on a test server under valgrind (or any equivalent tool) and ensure you're doing memory allocation right, and then let the application run without any memory allocation checking in production to maximize the speed. But, in fact, it depends on what your application do and what your needs are.