问题
I've actually been struggling against this for like 2 months now. What is it that makes these different?
hypotheses= X * theta
temp=(hypotheses-y)'
temp=X(:,1) * temp
temp=temp * (1 / m)
temp=temp * alpha
theta(1)=theta(1)-temp
hypotheses= X * theta
temp=(hypotheses-y)'
temp=temp * (1 / m)
temp=temp * alpha
theta(2)=theta(2)-temp
theta(1) = theta(1) - alpha * (1/m) * ((X * theta) - y)' * X(:, 1);
theta(2) = theta(2) - alpha * (1/m) * ((X * theta) - y)' * X(:, 2);
The latter works. I'm just not sure why..I struggle to understand the need for the matrix inverse .
回答1:
What you're doing in the first example in the second block you've missed out a step haven't you? I am assuming you concatenated X with a vector of ones.
temp=X(:,2) * temp
The last example will work but can be vectorized even more to be more simple and efficient.
I've assumed you only have 1 feature. it will work the same with multiple features since all that happens is you add an extra column to your X matrix for each feature. Basically you add a vector of ones to x to vectorize the intercept.
You can update a 2x1 matrix of thetas in one line of code. With x concatenate a vector of ones making it a nx2 matrix then you can calculate h(x) by multiplying by the theta vector (2x1), this is (X * theta) bit.
The second part of the vectorization is to transpose (X * theta) - y) which gives you a 1*n matrix which when multiplied by X (an n*2 matrix) will basically aggregate both (h(x)-y)x0 and (h(x)-y)x1. By definition both thetas are done at the same time. This results in a 1*2 matrix of my new theta's which I just transpose again to flip around the vector to be the same dimensions as the theta vector. I can then do a simple scalar multiplication by alpha and vector subtraction with theta.
X = data(:, 1); y = data(:, 2);
m = length(y);
X = [ones(m, 1), data(:,1)];
theta = zeros(2, 1);
iterations = 2000;
alpha = 0.001;
for iter = 1:iterations
theta = theta -((1/m) * ((X * theta) - y)' * X)' * alpha;
end
回答2:
In the first one, if X were a 3x2 matrix and theta were a 2x1 matrix, then "hypotheses" would be a 3x1 matrix.
Assuming y is a 3x1 matrix, then you can perform (hypotheses - y) and get a 3x1 matrix, then the transpose of that 3x1 is a 1x3 matrix assigned to temp.
Then the 1x3 matrix is set to theta(2), but this should not be a matrix.
The last two lines of your code works because, using my mxn examples above,
(X * theta)
would be a 3x1 matrix.
Then that 3x1 matrix is subtracted by y (a 3x1 matrix) and the result is a 3x1 matrix.
(X * theta) - y
So the transpose of the 3x1 matrix is a 1x3 matrix.
((X * theta) - y)'
Finally, a 1x3 matrix times a 3x1 matrix will equal a scalar or 1x1 matrix, which is what you are looking for. I'm sure you knew already, but just to be thorough, the X(:,2) is the second column of the 3x2 matrix, making it a 3x1 matrix.
回答3:
When you update you need to do like
Start Loop {
temp0 = theta0 - (equation_here);
temp1 = theta1 - (equation_here);
theta0 = temp0;
theta1 = temp1;
} End loop
回答4:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
% Performs gradient descent to learn theta. Updates theta by taking num_iters
% gradient steps with learning rate alpha.
% Number of training examples
m = length(y);
% Save the cost J in every iteration in order to plot J vs. num_iters and check for convergence
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
h = X * theta;
stderr = h - y;
theta = theta - (alpha/m) * (stderr' * X)';
J_history(iter) = computeCost(X, y, theta);
end
end
回答5:
This can be vectorized more simply with
h = X * theta % m-dimensional matrix (prediction our hypothesis gives per training example)
std_err = h - y % an m-dimensional matrix of errors (one per training example)
theta = theta - (alpha/m) * X' * std_err
Remember X, is the design matrix, and as such each row of X represents a training example and each column of X represents a given component (say the zeroth or first components) across all training examples. Each column of X is therefore exactly the thing we want to multiply element-wise with the std_err before summing to get the corresponding component of the theta vector.
回答6:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1 : num_iters
hypothesis = X * theta;
Error = (hypothesis - y);
temp = theta - ((alpha / m) * (Error' * X)');
theta = temp;
J_history(iter) = computeCost(X, y, theta);
end
end
回答7:
.
.
.
.
.
.
.
.
.
Spoiler alert
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
% ====================== YOUR CODE HERE ======================
% Instructions: Perform a single gradient step on the parameter vector
% theta.
%
% Hint: While debugging, it can be useful to print out the values
% of the cost function (computeCost) and gradient here.
% ========================== BEGIN ===========================
t = zeros(2,1);
J = computeCost(X, y, theta);
t = theta - ((alpha*((theta'*X') - y'))*X/m)';
theta = t;
J1 = computeCost(X, y, theta);
if(J1>J),
break,fprintf('Wrong alpha');
else if(J1==J)
break;
end;
% ========================== END ==============================
% Save the cost J in every iteration
J_history(iter) = sum(computeCost(X, y, theta));
end
end
来源:https://stackoverflow.com/questions/10591343/gradient-descent-implementation-in-octave