I am have implemented a rather complex new Op in Tensorflow with a GPU CUDA kernel. This Op requires a lot of dynamic memory allocation of variables which are not tensors and are deallocated after the op is done, more specifically it involves using a hash table.
Right now I am using cudaMalloc()
and cudaFree()
but I have noticed Tensorflow has its own type called Eigen::GPUDevice
which has the ability to allocate and deallocate memory on the GPU.
My questions:
- Is it best practice to use
Eigen::GPUDevice
to manage GPU memory; - By using
Eigen::GPUDevice
instead of the CUDA API I am "automatically" enabling multi-GPU support since differentGPUDevices
can be passed to the Op; - Should I extend this idea to the CPU kernel and see if there is a
CPUDevice
type which also manages the memory instead of using C++ syntax (i.e.auto var = new int[100]; delete[] var
)
The is no direct public guideline for this issue. I usually just let the TensorFlow allocate this information by
template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
explicit MyOp(OpKernelConstruction *context) :
OpKernel(context)
{
// ...
}
void Compute(OpKernelContext *context) override
{
Tensor* tmp_var = nullptr;
Tensor* output = nullptr;
TensorShape some_shape, some_shape2;
// temparily use this space
OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
// allocate memory for output tensor
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));
- whatever needs memory, should be allocated by the TensorFlow context and not by custom
cudaMalloc
ornew type[num]
calls. - the context should provide the information for the Allocator
- see below
Consider, for the sake of simplicity just adding two matrices (full example). TensorFlow-Operations usually contain the following structure:
Op description having REGISTER_OP
, which is responsible for shape-checking, and setting the output shape (example)
OpKernel responsible for allocating memory, getting pointer to the inputs and setup stuff, (see above or this )
Functor for the implementation itself, like
Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);
You are just left by implementing
// gpu version
template <typename Dtype>
struct MyFunctor<GPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
// cpu version
template <typename Dtype>
struct MyFunctor<CPUDevice, Dtype> {
void operator ()(::tensorflow::OpKernelContext* ctx,...)
edit
- allocate_persistent: use this if you need your data between Op invocations like one-time index structures.[example]
- allocate_temp just tmp memory which will be not retained at the end of the
Compute
method lifetime. [example]
But I highly recommend reading the comment in the source-code here and then decided depending on your use case.
The best practice is to use the OpKernelContext::allocate_persistent()
method to allocate memory, in the form of a tensorflow::Tensor
, that outlives a single call to OpKernel::Compute()
. It uses the appropriate Allocator*
for the device, so if the kernel runs on a GPU device, it will allocate GPU memory for that particular device, and if it runs on a CPU device it will allocate CPU memory.
来源:https://stackoverflow.com/questions/48580580/tensorflow-new-op-cuda-kernel-memory-management