gradient-descent

What is `weight_decay` meta parameter in Caffe?

元气小坏坏 提交于 2019-11-26 22:11:31
Looking at an example 'solver.prototxt' , posted on BVLC/caffe git, there is a training meta parameter weight_decay: 0.04 What does this meta parameter mean? And what value should I assign to it? Shai The weight_decay meta parameter govern the regularization term of the neural net. During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation. As a rule of thumb, the more training examples you have, the weaker this term should be. The more

Why should weights of Neural Networks be initialized to random numbers?

梦想与她 提交于 2019-11-26 21:18:11
I am trying to build a neural network from scratch. Across all AI literature there is a consensus that weights should be initialized to random numbers in order for the network to converge faster. But why are neural networks initial weights initialized as random numbers? I had read somewhere that this is done to "break the symmetry" and this makes the neural network learn faster. How does breaking the symmetry make it learn faster? Wouldn't initializing the weights to 0 be a better idea? That way the weights would be able to find their values (whether positive or negative) faster? Is there some

What is `lr_policy` in Caffe?

南楼画角 提交于 2019-11-26 17:59:16
问题 I just try to find out how I can use Caffe. To do so, I just took a look at the different .prototxt files in the examples folder. There is one option I don't understand: # The learning rate policy lr_policy: "inv" Possible values seem to be: "fixed" "inv" "step" "multistep" "stepearly" "poly" Could somebody please explain those options? 回答1: If you look inside the /caffe-master/src/caffe/proto/caffe.proto file (you can find it online here) you will see the following descriptions: // The

Spark mllib predicting weird number or NaN

谁说胖子不能爱 提交于 2019-11-26 17:48:30
问题 I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points: "365","4",41401.387,5330569 "364","3",51517.886,5946290 "363","2",55059.838,6097388 "362","1",43780.977,5304694 "361","7",46447.196,5471836 "360","6",50656.121,5849862 "359","5",44494.476,5460289 Here's my code: def parsePoint(line): split = map(sanitize, line.split(',')) rev = split.pop(-2) return LabeledPoint(rev, split) def

Common causes of nans during training

喜欢而已 提交于 2019-11-26 15:41:00
I've noticed that a frequent occurrence during training is NAN s being introduced. Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up. Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data? The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do

gradient descent using python and numpy

为君一笑 提交于 2019-11-26 15:37:57
def gradient(X_norm,y,theta,alpha,m,n,num_it): temp=np.array(np.zeros_like(theta,float)) for i in range(0,num_it): h=np.dot(X_norm,theta) #temp[j]=theta[j]-(alpha/m)*( np.sum( (h-y)*X_norm[:,j][np.newaxis,:] ) ) temp[0]=theta[0]-(alpha/m)*(np.sum(h-y)) temp[1]=theta[1]-(alpha/m)*(np.sum((h-y)*X_norm[:,1])) theta=temp return theta X_norm,mean,std=featureScale(X) #length of X (number of rows) m=len(X) X_norm=np.array([np.ones(m),X_norm]) n,m=np.shape(X_norm) num_it=1500 alpha=0.01 theta=np.zeros(n,float)[:,np.newaxis] X_norm=X_norm.transpose() theta=gradient(X_norm,y,theta,alpha,m,n,num_it)

How to interpret caffe log with debug_info?

我怕爱的太早我们不能终老 提交于 2019-11-26 15:29:19
When facing difficulties during training ( nan s , loss does not converge , etc.) it is sometimes useful to look at more verbose training log by setting debug_info: true in the 'solver.prototxt' file. The training log then looks something like: I1109 ...] [Forward] Layer data, top blob data data: 0.343971 I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037 I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114 I1109 ...] [Forward] Layer conv1, param blob 1 data: 0 I1109 ...] [Forward] Layer relu1, top blob conv1 data: 0.0337982 I1109 ...] [Forward] Layer conv2, top blob

Why should weights of Neural Networks be initialized to random numbers?

徘徊边缘 提交于 2019-11-26 12:16:35
问题 I am trying to build a neural network from scratch. Across all AI literature there is a consensus that weights should be initialized to random numbers in order for the network to converge faster. But why are neural networks initial weights initialized as random numbers? I had read somewhere that this is done to \"break the symmetry\" and this makes the neural network learn faster. How does breaking the symmetry make it learn faster? Wouldn\'t initializing the weights to 0 be a better idea?

Caffe: What can I do if only a small batch fits into memory?

半腔热情 提交于 2019-11-26 09:53:33
问题 I am trying to train a very large model. Therefore, I can only fit a very small batch size into GPU memory. Working with small batch sizes results with very noisy gradient estimations. What can I do to avoid this problem? 回答1: You can change the iter_size in the solver parameters. Caffe accumulates gradients over iter_size x batch_size instances in each stochastic gradient descent step. So increasing iter_size can also get more stable gradient when you cannot use large batch_size due to the

What is `weight_decay` meta parameter in Caffe?

て烟熏妆下的殇ゞ 提交于 2019-11-26 08:13:05
问题 Looking at an example \'solver.prototxt\', posted on BVLC/caffe git, there is a training meta parameter weight_decay: 0.04 What does this meta parameter mean? And what value should I assign to it? 回答1: The weight_decay meta parameter govern the regularization term of the neural net. During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.