gradient descent seems to fail

喜夏-厌秋 提交于 2019-11-28 16:06:48

I think that your computeCost function is wrong. I attended NG's class last year and I have the following implementation (vectorized):

m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;

J = 1/(2*m) * sum(sqrErrors);

The rest of the implementation seems fine to me, although you could also vectorize them.

theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));

Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.

Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.

If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:

theta = theta - (alpha/m) * (X' * (X * theta - y))

Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:

This is the gradient descent algorithm to fine tune the value of θ:

Assume that the following values of X, y and θ are given:

  • m = number of training examples
  • n = number of features + 1

Here

  • m = 5 (training examples)
  • n = 4 (features+1)
  • X = m x n matrix
  • y = m x 1 vector matrix
  • θ = n x 1 vector matrix
  • xi is the ith training example
  • xj is the jth feature in a given training example

Further,

  • h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
  • h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)

whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:

To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:

This can be simplified as:

  • [E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.

More succinctly, it can be written as:

Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as

This is the original expression we started out with:

theta = theta - (alpha/m) * (X' * (X * theta - y))

i vectorized the theta thing... may could help somebody

theta = theta - (alpha/m *  (X * theta-y)' * X)';
user2437742

If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.

Here is the normal equation: http://mathworld.wolfram.com/NormalEquation.html

And in octave form:

theta = (pinv(X' * X )) * X' * y

Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave

While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.

With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.

Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:

Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[  
  Theta = Theta - 
  alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m; 
  Jhist[[k]] = 
  Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]

Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'

This should work:-

theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) ); 

theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) ); 

its cleaner this way, and vectorized also

predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);

If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.

Implementation Note: If your learning rate is too large, J(theta) can di- verge and blow up', resulting in values which are too large for computer calculations. In these situations, Octave/MATLAB will tend to return NaNs. NaN stands fornot a number' and is often caused by undened operations that involve - infinity and +infinity.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!