Merge weights of same model trained on 2 different computers using tensorflow

荒凉一梦 提交于 2019-12-12 16:18:42

问题


I was doing some research on training deep neural networks using tensorflow. I know how to train a model. My problem is i have to train the same model on 2 different computers with different datasets. Then save the model weights. Later i have to merge the 2 model weight files somehow. I have no idea how to merge them. Is there a function that does this or should the weights be averaged?

Any help on this problem would be useful

Thanks in advance


回答1:


It is better to merge weight updates (gradients) during the training and keep a common set of weights rather than trying to merge the weights after individual trainings have completed. Both individually trained networks may find a different optimum and e.g. averaging the weights may give a network which performs worse on both datasets.

There are two things you can do:

  1. Look at 'data parallel training': distributing forward and backward passes of the training process over multiple compute nodes each of which has a subset of the entire data.

In this case typically:

  • each node propagates a minibatch forward through the network
  • each node propagates the loss gradient backwards through the network
  • a 'master node' collects gradients from minibatches on all nodes and updates the weights correspondingly
  • and distributes the weight updates back to the compute nodes to make sure each of them has the same set of weights

(there are variants of the above to avoid that compute nodes idle too long waiting for results from others). The above assumes that Tensorflow processes running on the compute nodes can communicate with each other during the training.

Look at https://www.tensorflow.org/deploy/distributed) for more details and an example of how to train networks over multiple nodes.


  1. If you really have train the networks separately, look at ensembling, see e.g. this page: https://mlwave.com/kaggle-ensembling-guide/ . In a nutshell, you would train individual networks on their own machines and then e.g. use an average or maximum over the outputs of both networks as a combined classifier / predictor.



回答2:


There is literally no way to merge weights, you cannot average or combine them in any way, as the result will not mean anything. What you could do instead is combine predictions, but for that the training classes have to be the same.

This is not a programming limitation but a theoretical one.



来源:https://stackoverflow.com/questions/48358874/merge-weights-of-same-model-trained-on-2-different-computers-using-tensorflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!