gradient-descent | 易学教程

What is `weight_decay` meta parameter in Caffe?

阅读更多关于 What is `weight_decay` meta parameter in Caffe?

Looking at an example 'solver.prototxt' , posted on BVLC/caffe git, there is a training meta parameter weight_decay: 0.04 What does this meta parameter mean? And what value should I assign to it? Shai The weight_decay meta parameter govern the regularization term of the neural net. During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation. As a rule of thumb, the more training examples you have, the weaker this term should be. The more

Why should weights of Neural Networks be initialized to random numbers?

阅读更多关于 Why should weights of Neural Networks be initialized to random numbers?

I am trying to build a neural network from scratch. Across all AI literature there is a consensus that weights should be initialized to random numbers in order for the network to converge faster. But why are neural networks initial weights initialized as random numbers? I had read somewhere that this is done to "break the symmetry" and this makes the neural network learn faster. How does breaking the symmetry make it learn faster? Wouldn't initializing the weights to 0 be a better idea? That way the weights would be able to find their values (whether positive or negative) faster? Is there some

What is `lr_policy` in Caffe?

阅读更多关于 What is `lr_policy` in Caffe?

问题 I just try to find out how I can use Caffe. To do so, I just took a look at the different .prototxt files in the examples folder. There is one option I don't understand: # The learning rate policy lr_policy: "inv" Possible values seem to be: "fixed" "inv" "step" "multistep" "stepearly" "poly" Could somebody please explain those options? 回答1: If you look inside the /caffe-master/src/caffe/proto/caffe.proto file (you can find it online here) you will see the following descriptions: // The

Spark mllib predicting weird number or NaN

阅读更多关于 Spark mllib predicting weird number or NaN

问题 I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points: "365","4",41401.387,5330569 "364","3",51517.886,5946290 "363","2",55059.838,6097388 "362","1",43780.977,5304694 "361","7",46447.196,5471836 "360","6",50656.121,5849862 "359","5",44494.476,5460289 Here's my code: def parsePoint(line): split = map(sanitize, line.split(',')) rev = split.pop(-2) return LabeledPoint(rev, split) def

Common causes of nans during training

阅读更多关于 Common causes of nans during training

I've noticed that a frequent occurrence during training is NAN s being introduced. Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up. Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data? The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do

gradient descent using python and numpy

阅读更多关于 gradient descent using python and numpy

def gradient(X_norm,y,theta,alpha,m,n,num_it): temp=np.array(np.zeros_like(theta,float)) for i in range(0,num_it): h=np.dot(X_norm,theta) #temp[j]=theta[j]-(alpha/m)*( np.sum( (h-y)*X_norm[:,j][np.newaxis,:] ) ) temp[0]=theta[0]-(alpha/m)*(np.sum(h-y)) temp[1]=theta[1]-(alpha/m)*(np.sum((h-y)*X_norm[:,1])) theta=temp return theta X_norm,mean,std=featureScale(X) #length of X (number of rows) m=len(X) X_norm=np.array([np.ones(m),X_norm]) n,m=np.shape(X_norm) num_it=1500 alpha=0.01 theta=np.zeros(n,float)[:,np.newaxis] X_norm=X_norm.transpose() theta=gradient(X_norm,y,theta,alpha,m,n,num_it)

How to interpret caffe log with debug_info?

阅读更多关于 How to interpret caffe log with debug_info?

When facing difficulties during training ( nan s , loss does not converge , etc.) it is sometimes useful to look at more verbose training log by setting debug_info: true in the 'solver.prototxt' file. The training log then looks something like: I1109 ...] [Forward] Layer data, top blob data data: 0.343971 I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037 I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114 I1109 ...] [Forward] Layer conv1, param blob 1 data: 0 I1109 ...] [Forward] Layer relu1, top blob conv1 data: 0.0337982 I1109 ...] [Forward] Layer conv2, top blob

Why should weights of Neural Networks be initialized to random numbers?

阅读更多关于 Why should weights of Neural Networks be initialized to random numbers?

问题 I am trying to build a neural network from scratch. Across all AI literature there is a consensus that weights should be initialized to random numbers in order for the network to converge faster. But why are neural networks initial weights initialized as random numbers? I had read somewhere that this is done to \"break the symmetry\" and this makes the neural network learn faster. How does breaking the symmetry make it learn faster? Wouldn\'t initializing the weights to 0 be a better idea?

Caffe: What can I do if only a small batch fits into memory?

阅读更多关于 Caffe: What can I do if only a small batch fits into memory?

问题 I am trying to train a very large model. Therefore, I can only fit a very small batch size into GPU memory. Working with small batch sizes results with very noisy gradient estimations. What can I do to avoid this problem? 回答1: You can change the iter_size in the solver parameters. Caffe accumulates gradients over iter_size x batch_size instances in each stochastic gradient descent step. So increasing iter_size can also get more stable gradient when you cannot use large batch_size due to the

What is `weight_decay` meta parameter in Caffe?

阅读更多关于 What is `weight_decay` meta parameter in Caffe?

问题 Looking at an example \'solver.prototxt\', posted on BVLC/caffe git, there is a training meta parameter weight_decay: 0.04 What does this meta parameter mean? And what value should I assign to it? 回答1: The weight_decay meta parameter govern the regularization term of the neural net. During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.