问题
As this uses a sigmoid function instead of a zero/one activation function I guess this is the right way to calculate gradient descent, is that right?
  static double calculateOutput( int theta, double weights[], double[][] feature_matrix, int file_index, int globo_dict_size )
  {
     //double sum = x * weights[0] + y * weights[1] + z * weights[2] + weights[3];
     double sum = 0.0;
     for (int i = 0; i < globo_dict_size; i++) 
     {
         sum += ( weights[i] * feature_matrix[file_index][i] );
     }
     //bias
     sum += weights[ globo_dict_size ];
     return sigmoid(sum);
  }
  private static double sigmoid(double x)
  {
      return 1 / (1 + Math.exp(-x));
  }
This following code where I'm trying up update my Θ values, (equivalent to weights in perceptron, isn't it?), I was given this formula LEARNING_RATE * localError * feature_matrix__train[p][i] * output_gradient[i] for that purpose in my related question. I commented out the weight update from my perceptron.
Is this new update rule the correct approach?
What is meant by output_gradient? Is that equivalent to the sum I calculate in my calculateOutput method?
      //LEARNING WEIGHTS
      double localError, globalError;
      int p, iteration, output;
      iteration = 0;
      do 
      {
          iteration++;
          globalError = 0;
          //loop through all instances (complete one epoch)
          for (p = 0; p < number_of_files__train; p++) 
          {
              // calculate predicted class
              output = calculateOutput( theta, weights, feature_matrix__train, p, globo_dict_size );
              // difference between predicted and actual class values
              localError = outputs__train[p] - output;
              //update weights and bias
              for (int i = 0; i < globo_dict_size; i++) 
              {
                  //weights[i] += ( LEARNING_RATE * localError * feature_matrix__train[p][i] );
                  weights[i] += LEARNING_RATE * localError * feature_matrix__train[p][i] * output_gradient[i]
              }
              weights[ globo_dict_size ] += ( LEARNING_RATE * localError );
              //summation of squared error (error value for all instances)
              globalError += (localError*localError);
          }
          /* Root Mean Squared Error */
          if (iteration < 10) 
              System.out.println("Iteration 0" + iteration + " : RMSE = " + Math.sqrt( globalError/number_of_files__train ) );
          else
              System.out.println("Iteration " + iteration + " : RMSE = " + Math.sqrt( globalError/number_of_files__train ) );
          //System.out.println( Arrays.toString( weights ) );
      } 
      while(globalError != 0 && iteration<=MAX_ITER);
UPDATE Now I've updated things, looks more like this:
  double loss, cost, hypothesis, gradient;
  int p, iteration;
  iteration = 0;
  do 
  {
    iteration++;
    cost = 0.0;
    loss = 0.0;
    //loop through all instances (complete one epoch)
    for (p = 0; p < number_of_files__train; p++) 
    {
      // 1. Calculate the hypothesis h = X * theta
      hypothesis = calculateHypothesis( theta, feature_matrix__train, p, globo_dict_size );
      // 2. Calculate the loss = h - y and maybe the squared cost (loss^2)/2m
      loss = hypothesis - outputs__train[p];
      // 3. Calculate the gradient = X' * loss / m
      gradient = calculateGradent( theta, feature_matrix__train, p, globo_dict_size, loss );
      // 4. Update the parameters theta = theta - alpha * gradient
      for (int i = 0; i < globo_dict_size; i++) 
      {
          theta[i] = theta[i] - (LEARNING_RATE * gradient);
      }
    }
    //summation of squared error (error value for all instances)
    cost += (loss*loss);
  /* Root Mean Squared Error */
  if (iteration < 10) 
      System.out.println("Iteration 0" + iteration + " : RMSE = " + Math.sqrt( cost/number_of_files__train ) );
  else
      System.out.println("Iteration " + iteration + " : RMSE = " + Math.sqrt( cost/number_of_files__train ) );
  //System.out.println( Arrays.toString( weights ) );
  } 
  while(cost != 0 && iteration<=MAX_ITER);
}
static double calculateHypothesis( double theta[], double[][] feature_matrix, int file_index, int globo_dict_size )
{
    double hypothesis = 0.0;
     for (int i = 0; i < globo_dict_size; i++) 
     {
         hypothesis += ( theta[i] * feature_matrix[file_index][i] );
     }
     //bias
     hypothesis += theta[ globo_dict_size ];
     return hypothesis;
}
static double calculateGradent( double theta[], double[][] feature_matrix, int file_index, int globo_dict_size, double loss )
{
    double gradient = 0.0;
     for (int i = 0; i < globo_dict_size; i++) 
     {
         gradient += ( feature_matrix[file_index][i] * loss);
     }
     return gradient;
}
public static double hingeLoss()
{
    // l(y, f(x)) = max(0, 1 − y · f(x))
    return HINGE;
}
回答1:
Your calculateOutput method looks correct. Your next piece of code I don't really think so:
weights[i] += LEARNING_RATE * localError * feature_matrix__train[p][i] * output_gradient[i]
Look at the image you posted in your other question:
 
Let's try to identify each part of these rules in your code.
- Theta0 andTheta1: looks like- weights[i]in your code; I hope- globo_dict_size = 2;
- alpha: seems to be your- LEARNING_RATE;
- 1 / m: I can't find this anywhere in your update rule.- mis the number of training instances in Andrew Ng's videos. In your case, it should be- 1 / number_of_files__trainI think; It's not very important though, things should work well even without it.
- The sum: you do this with the - calculateOutputfunction, whose result you make use of in the- localErrorvariable, which you multiply by- feature_matrix__train[p][i](equivalent to- x(i)in Andrew Ng's notation).- This part is your partial derivative, and part of the gradient! - Why? Because the partial derivative of - [h_theta(x(i)) - y(i)]^2with respect to- Theta0is equal to:- 2*[h_theta(x(i)) - y(i)] * derivative[h_theta(x(i)) - y(i)] derivative[h_theta(x(i)) - y(i)] = derivative[Theta0 * x(i, 1) + Theta1*x(i, 2) - y(i)] = x(i, 1)- Of course, you should derive the entire sum. This is also why Andrew Ng used - 1 / (2m)for the cost function, so the- 2would cancel out with the- 2we get from derivation.- Remember that - x(i, 1), or just- x(1)should consist of all ones. In your code, you should make sure that:- feature_matrix__train[p][0] == 1
- That's it! I don't know what - output_gradient[i]is supposed to be in your code, you don't define it anywhere.
I suggest you take a look at this tutorial to get a better understanding of the algorithm you have used. Since you use the sigmoid function, it seems like you want to do classification, but then you should use a different cost function. That document deals with logistic regression as well.
来源:https://stackoverflow.com/questions/28923292/calculate-gradient-output-for-theta-update-rule