Why does higher need to deep copy the parameters of the base model to create a functional model?

问题

I found this line of code in the higher library:

self.param_groups = _copy.deepcopy(other.param_groups)

and I don't understand why that's needed.

If anything I think it's harmful as I've outline here. You can go to the issue to see my reasons but the gist is this:

Wouldn't having that deep copy mean the (outer loop) optimizer would be computing the gradients with respect to parameters no present in the computation graph? Since:

the parameters of the differentiable/inner optimizer are a deep copy compared to the initial parameters/weights the outer optimizer (e.g. Adam) would have the original/initial parameters, so the gradient of these should always be zero. That is the only explanation that I can think of to explain my issues in the past (gradients being zero unexpectedly) however it seems the higher MAML tutorial works, which should go against my theory. If my theory is right at the end of the inner loop of MAML and when the outer optimizer (usually Adam) computes the gradients, they should be zero (which I have observed sometimes). But I assume they are NOT zero, otherwise that tutorial wouldn't work.

So I am inquiring about the need to use deep copy when creating inner optimizers. What is its purpose and why is it not causing the issues I describe in the original MAML tutorial in higher. How is it that the deep copy doesn't break the forward pass and thus the whole computation of gradient wrt the initialization that the outer optimizer would use?

I think at the core of my confusion is that I don't understand why we need to do the deepcopy in the first place. Without all the other code (that seems convoluted to me) we even risk that the initialization we might want to train with the outer optimizer might not train, since the outer/meta optimizer has a pointer to the params of the original model and not a copy of the deep copy the inner optimizer could have had.

Why would the developers go through all that by adding substantial code that seems to have high risks?

Related question on how the copying of the initial parameters happens in higher: What does the copy_initial_weights documentation mean in the higher library for Pytorch?

回答1:

The main reason for that line is to copy everything but the trainable weights judging by the later code. Unfortunately it is difficult to achieve without copying weights too, so just a call to deepcopy is used.

If you trace how self.param_groups are used you will find that 'params' of each element is actually just replaced by None later here.

The initialization of differentiable optimizer here needs to make copies of all parameters the reference other optimizer has (including tensor and non-tensor ones such as lr, and states e.g. momentum_buffer, but states are copied later here). This is effectively creating a snapshot of all parameters of other optimizer except for the trainable weights other was supposed to accumulate gradients into. So overall the gradients don't propagate through these copies - they propagate through initial weights of fmodel (if copy_initial_weights=False for that model) and/or through tensors requiring gradient which were passed to differentiable optimizer using override.

来源：https://stackoverflow.com/questions/62437960/why-does-higher-need-to-deep-copy-the-parameters-of-the-base-model-to-create-a-f

标签

machine-learning

pytorch