梯度下降法

梯度下降法

梯度下降法，是一种基于搜索的最优化方法，最用是最小化一个损失函数。

一、什么是梯度下降？

机器学习算法都需要最大化或最小化一个函数，这个函数被称为"目标函数"，其中我们一般把最小化的一类函数，称为"损失函数"。它能根据预测结果，衡量出模型预测能力的好坏。在求损失函数最小化的过程中使用梯度下降法。

$\frac{dJ(\theta)}{d\theta}$

在直线方程中，倒数代表斜率，在曲线方程中，倒数代表切线的斜率。倒数代表着参数theta单位变化时，损失函数J相应的的变化。通过上面图中的点可以发现，改点的导数为负值，所以随着参数theta的增加，损失函数J减小，因此导数从某种意义上还可以代表方向，对应着损失函数J增大的方向。

$-\eta\frac{dJ(\theta)}{d\theta}$

综上，如果最小化一个函数，我们就需要得到导数再取个负数，并且再乘以一个系数，这个系数通常叫做步长或者叫学习率(Learning rate, Lr)。\eta的取值影响获得求最优解的速度，取值不合适的话甚至得不到最优解，它是梯度下降的一个超参数。\eta太小，减慢收敛速度效率，\eta太大，甚至会导致不收敛。

对于梯度下降算法而言，最不友好的就是并不是所有的函数都有唯一的极值点。很大概率就是局部最优解，并不是真正的全局最优解。这还只是针对二维平面来说，如果对于高维空间更加相对复杂的环境，就更不好说了。

解决方案：多次运行，随机化初始点。初始点也是梯度下降算法的一个超参数。初始化的方法也有很多。

二、简单线性回归中使用梯度下降法

在线性回归算法中目标函数J(\theta)=\sum_{i=1}^m(y^{(i)}-\hat{y}^{(i)})^2,线性回归算法具有唯一的最优解。下面就来实现一下：

首先，在Jupyter Notebook先生成一组数据。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-1, 6, 141)
y = (x - 2.5) ** 2 - 1
plt.plot(x ,y)
plt.show()

def dj(theta):
    return 2 * (theta - 2.5)
def J(theta):
    return (theta-2.5) ** 2 - 1

theta = 0.0
epsilon = 1e-8
eta = 0.1
while True:
    gradient = dj(theta)
    last_theta = theta
    theta = theta - eta * gradient
    
    if (abs(J(theta) - J(last_theta)) < epsilon):
        break
print(theta)
print(J(theta))

输出结果：

2.499891109642585
-0.99999998814289

从输出结果来看，\theta=2.499891109642585，计算机的浮点数计算精度来说，所以达不到2.5。所以才设置epsilon=1e-8，也是因为这个原因。接下来，看一下梯度下降的过程：

theta = 0.0
epsilon = 1e-8
eta = 0.1
history_theta = [theta]
while True:
    gradient = dj(theta)
    last_theta = theta
    history_theta.append(theta)
    theta = theta - eta * gradient
    
    if (abs(J(theta) - J(last_theta)) < epsilon):
        break
plt.plot(x, J(x))
plt.plot(np.array(history_theta), J(np.array(history_theta)), 'r', marker='+')
plt.show()

从上面的图中看出，刚开始下降很快，是因为刚开始的梯度大，所以下降比较快。再看看走了多少步？

len(history_theta)

输出结果：46

下面，把梯度下降封装成一个函数：

def dj(theta):
    return 2 * (theta - 2.5)

def J(theta):
    return (theta-2.5) ** 2 - 1

def gradient_descent(initial_theta, eta, epsilon=1e-8):
    theta = initial_theta
    theta_history.append(initial_theta)

    while True:
        gradient = dj(theta)
        last_theta = theta
        theta = theta - eta * gradient
        theta_history.append(theta)

        if (abs(J(theta) - J(last_theta)) < epsilon):
            break
            
def plot_theta_history():
    plt.plot(x, J(x))
    plt.plot(np.array(theta_history), J(np.array(theta_history)), 'r', marker='+')
    plt.show()

测试：

eta = 0.01
theta_history = []
gradient_descent(0., eta)
plot_theta_history()
len(theta_history)

这是eta=0.01的情况，一共有3682个点，有兴趣还可以试试0.001，那这是减小学习率的过程，如果增大学习率试试，eta=0.8，

eta = 0.8
theta_history = []
gradient_descent(0., eta)
plot_theta_history()
len(theta_history)

可以看出虽然第一步跳到的右边，但是总体还是在朝着减小的地方移动，最终也能寻找到最优解。这说明我们不一定非要沿着一边一点一点的减少，只要在一个合理的范围内就可以。但是如果把eta设置为1.1就会报错。但是如果我们修改一下J这个函数，就不会报错，但是会陷入一个死循环，因此我们还要加一个循环次数限制，强行让程序停止。

def J(theta):
    try:
        return (theta-2.5) ** 2 - 1
    except:
        return float('inf')

def gradient_descent(initial_theta, eta, n_iters= 1e4, epsilon=1e-8):
    theta = initial_theta
    theta_history.append(initial_theta)

    while i_iter < n_iters:
        gradient = dj(theta)
        last_theta = theta
        theta = theta - eta * gradient
        theta_history.append(theta)

        if (abs(J(theta) - J(last_theta)) < epsilon):
            break
        i_iter += 1

eta = 1.1
theta_history = []
gradient_descent(0., eta, n_iters=10)
plot_theta_history()

1565100109319

eta = 1.1
theta_history = []
gradient_descent(0., eta)
theta_history[-1]

输出结果：nan。这是因为在python中无穷大-无穷大=NaN

三、多元线性回归算法中使用梯度下降法

由于目标函数的梯度随着特征向量的增加会越来越大，这是不合理的，因此我们给它加个系数1/m。

如果想把2消掉，J（θ）=MSE（y-y_hat）

首先，先随机生成一组数据：

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3. + 4. + np.random.normal(size=100)
X = x.reshape(-1, 1)
X.shape
y.shape
plt.scatter(x, y)
plt.show()

def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b)
    except:
        return float('inf')

def dj(theta, X_b, y):
    res = np.empty(len(theta))
    res[0] = np.sum(X_b.dot(theta) - y)
    for i in range(1, len(theta)):
        res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
        
    return res * 2 / len(X_b)

def gradient_descent(X_b, y, init_theta, eta, n_iters=1e4, epsilon=1e-8):
    
    theta = init_theta
    i_iter = 0
    
    while i_iter < n_iters:
        gradient = dj(theta, X_b, y)
        last_theta = theta
        theta = last_theta - eta * gradient
        
        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break
        
        i_iter += 1
        
    return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
init_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, init_theta, eta)
theta

输出结果：array([4.02145786, 3.00706277])

这跟生成数据时候的截距为4，系数为3。所以这样的预测结果也在预料之中。

接下来将这个函数封装到我之前写的多元线性回归的类中。之前的多元线性回归中我们使用的是正规方程解fit_normal，现在我们将梯度下降方法加入这个类中fit_gd。

import numpy as np
from play_ML.multi_linear_regression.metrics import r2_score

class LinearRegression(object):

    def __init__(self):
        "初始化多元线性回归模型"
        self.coef_ = None
        self.interception_ = None
        self._theta = None

    def fit_normal(self, x_train, y_train):
        assert x_train.shape[0] == y_train.shape[0], \
        "the size of x_train must be equal to the size of y_train"

        X = np.hstack([np.ones((len(x_train), 1)), x_train])
        self._theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y_train)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def fit_gd(self, x_train, y_train, eta=0.01):
        assert x_train.shape[0] == y_train.shape[0], \
        "the size of x_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b)
            except:
                return float('inf')

        def dj(theta, X_b, y):
            res = np.empty(len(theta))
            res[0] = np.sum(X_b.dot(theta) - y)
            for i in range(1, len(theta)):
                res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            return res * 2 / len(X_b)

        def gradient_descent(X_b, y, init_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = init_theta
            i_iter = 0

            while i_iter < n_iters:
                gradient = dj(theta, X_b, y)
                last_theta = theta
                theta = last_theta - eta * gradient

                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                i_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(x_train), 1)), x_train])
        init_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, init_theta, eta, n_iters=1e4)
        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, x_predict):
        assert self.interception_ is not None and self.coef_ is not None, \
        "must fit before predict"
        assert x_predict.shape[1] == len(self.coef_), \
        "the feature number must be equal to x_train"

        X = np.hstack([np.ones((len(x_predict), 1)), x_predict])
        return X.dot(self._theta)

    def score(self, x_test, y_test):
        y_preict = self.predict(x_test)
        return r2_score(y_test, y_preict)

    def __repr__(self):
        return "Multi_Linear_Regression"

四、梯度下降算法的向量化

所以，我们需要将求梯度的函数进行改造一下：之前一直都是使用随机生成的数据，接下来使用波士顿房价进行多元线性回归，并且使用梯度下降法的向量化，

def dj(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

if __name__ == '__main__':
    import numpy as np
    from sklearn import datasets
    from sklearn.model_selection import train_test_split

    boston = datasets.load_boston()
    x = boston.data
    y = boston.target

    X = x[y < 50]
    Y = y[y < 50]

    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
    reg1 = LinearRegression()
    reg1.fit_gd(x_train, y_train)

报错：RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs)，数据溢出。之前我们就说过如果求导得到的梯度需要乘以一个系数，不然会数据溢出，那么在这里为什么又overflow呢？这是因为在我们的数据中肯定有偏向两端的极值存在，比如1000个数据中900个是1000，另外100个是1，就算我们除以1000仍然会可能会导致数据溢出。这也就是之前为什么会谈到数据归一化的问题。

if __name__ == '__main__':
    import numpy as np
    from sklearn import datasets
    from sklearn.model_selection import train_test_split

    boston = datasets.load_boston()
    x = boston.data
    y = boston.target

    X = x[y < 50]
    Y = y[y < 50]

    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
    from sklearn.preprocessing import StandardScaler

    standard = StandardScaler()
    standard.fit(x_train)
    x_train_standard = standard.transform(x_train)
    reg1 = LinearRegression()
    reg1.fit_gd(x_train_standard, y_train)

    x_test_standard = standard.transform(x_test)
    print(reg1.score(x_test_standard, y_test))

因此，在使用梯度下降法前，最好进行数据归一化。

那使用梯度下降法向量化有什么优势呢？

import numpy as np
n = 20000
m = 1000
big_x = np.random.normal(size=(m, n))
true_theta = np.random.uniform(0., 100., size=n+1)
big_y = big_x.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0., 10., size=m)

%run D:/goodgoodstudy/play_ML/multi_linear_regression/linear_regression.py
big_reg1 = LinearRegression()
%time big_reg1.fit_normal(big_x, big_y) 
# 输出结果：Wall time: 1min 41s
big_reg2 = LinearRegression()
%time big_reg2.fit_gd(big_x, big_y)
# 输出结果：Wall time: 1.45 s

通过比较，梯度下降法向量化明显快很多。

五、梯度下降法

1、批量梯度下降法

批量梯度下降法，（Batch Gradient Descent），通过下面这个公式可以看出，如果想要求出梯度，每一项都要对所有的样本进行一次计算。这种计算方式一定能精确地求出最优梯度。但如果样本量m比较大的时候，计算梯度那是相当耗时的。因此，基于这个问题就有了随机梯度下降。

1565251301049

2、随机梯度下降

随机梯度下降，（Stochastic Gradient Descent），根据批量梯度下降算法，能不能每次只对一个样本进行求梯度，这样也就不用除以m了。基于这样的想法就出现了随机选择样本求梯度的方法，并且使其向量化，使用下面这个式子作为搜索的方向。这个式子已经不是目标函数的梯度方向了。

既然不确定这个搜索方向是不是梯度方向，那么学习率eta这个参数就更加重要了，如果一直取个不变的学习率，很有可能到达最优解之后还会跳出去。因此，在实际的使用过程中，在随机梯度下降法中需要让学习率逐渐递减。

这种思想其实在基于搜索领域的一种叫做模拟退火的思想。优势在于能够跳出局部最优解，拥有更快的计算速度，机器学习领域很多地方都会用到随机的特点，比如随机森林等。

def dj_sgd(theta, X_b_i, y_i):
    return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2.
def sgd(X_b, y, init_theta, n_iters):
    t0 = 5
    t1 = 50
    
    def learning_rate(t):
        return t0 / (t + t1)
    
    theta = init_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b))
        gradient = dj_sgd(theta, X_b[rand_i], y[rand_i])
        theta = theta - learning_rate(cur_iter) * gradient
        
    return theta

接下来，具体地测试一下：

%%time
X_b = np.hstack([np.ones((len(X),1)), X])
init_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, init_theta, n_iters=len(X_b) // 3)

输出结果：Wall time: 17.6 ms

theta

输出结果：Wall time: array([4.12778345, 3.16315746])

通过测试代码，可以看出只需要使用1/3的数据量求得的搜索方向也能找到最优解。当然在高维数据中，就不能这么写意地只是用1/3的样本。接下来将随机梯度下降封装一下，并做一些改进：

1.n_iters表示需要随机地将样本过几遍。

2.每次我们都将样本进行随机乱序。

def fit_sgd(self, x_train, y_train, n_iters=5, t0=5, t1=50):
        assert x_train.shape[0] == y_train.shape[0], \
        "the size of x_train must be equal to the size of y_train"
        assert n_iters >= 1, "must > 1"
        def dj_sgd(theta, X_b_i, y_i):
            return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2.

        def sgd(X_b, y, init_theta):

            def learning_rate(t):
                return t0 / (t + t1)

            theta = init_theta
            m = len(X_b)
            for cur_iter in range(n_iters):
                indexes = np.random.permutation(m)
                X_b_new = X_b[indexes]
                y_new = y[indexes]
                for i in range(m):
                    gradient = dj_sgd(theta, X_b[i], y[i])
                    theta = theta - learning_rate(cur_iter*m + i) * gradient

            return theta

        X_b = np.hstack([np.ones((len(x_train), 1)), x_train])
        init_theta = np.zeros(X_b.shape[1])
        self._theta = sgd(X_b, y_train, init_theta, n_iters, t0, t1)
        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

import numpy as np
import matplotlib.pyplot as plt
m = 10000
x = np.random.normal(size=m)
X = x.reshape(-1, 1)
y = x * 3. + 4. + np.random.normal(0, 3, size=m) 

%run D:/goodgoodstudy/play_ML/multi_linear_regression/linear_regression.py
reg3 = LinearRegression()
reg3.fit_sgd(X, y, n_iters=2)
reg3.interception_
reg3.coef_

使用真实数据集验证随机梯度下降，

if __name__ == '__main__':
    import numpy as np
    from sklearn import datasets
    from sklearn.model_selection import train_test_split

    boston = datasets.load_boston()
    x = boston.data
    y = boston.target

    X = x[y < 50]
    Y = y[y < 50]

    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
     from sklearn.preprocessing import StandardScaler

    standard = StandardScaler()
    standard.fit(x_train)
    x_train_standard = standard.transform(x_train)
    reg1 = LinearRegression()
    reg1.fit_sgd(x_train_standard, y_train, n_iters=10)
    x_test_standard = standard.transform(x_test)
    print(reg1.score(x_test_standard, y_test))

输出结果：当n_iters=5时，0.765475206668553

当n_iters=10时，0.8021056414726753

因此，我们可以调整n_iters使得准确率越来越高。

3、mini-batch Gradient Descent

小批量梯度下降法，mini-batch Gradient Descent，就是将批量梯度下降和随机梯度下降的两种结合，在深度学习的图像领域，会经常使用这种方法，因为相对一张图片来说假如是100*100=10000，因此如果直接使用批量梯度下降显然是不太现实的，所以结合随机梯度下降的思想，每次从样本中随机抽取一个mini-batch进行梯度求解，从而寻找搜索方向，进而找到最优解。

4、scikit-learn中封装实现sgd

在sklearn中还对sgd的实现进行了优化，具体的优化过程可以去github上看源码。

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

boston = datasets.load_boston()
x = boston.data
y = boston.target

X = x[y < 50]
Y = y[y < 50]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=666)
from sklearn.preprocessing import StandardScaler

standard = StandardScaler()
standard.fit(x_train)
x_train_standard = standard.transform(x_train)
x_test_standard = standard.transform(x_test)

from sklearn.linear_model import SGDRegressor
reg2 = SGDRegressor()       # 可以传入参数n_iter
reg2.fit(x_train_standard, y_train)
print(reg2.score(x_test_standard, y_test))

六、关于梯度的调试

有时候遇到复杂的求导函数，很有可能求解梯度并不容易，在这种情况下，推导出公式以后，通过编程实现，但真正运行的时候，程序并不报错，但求得的梯度是错误的。那如何发现这种问题的原因所在呢？

上面这种图的一种思想就是通过在这个点的左边和右边各找一个临近点，连成的直线的斜率近似地作为改点的斜率，也就是我们所说的导数，其实这也就是导数的定义中的思想。并且将其扩展到高维空间也是一样的。

很显然，这种导数的求法从数学的意义上解释是很简单的，但是这样做带来的问题就是时间复杂度将会增加很多，因为需要在一个点进行2次求导，因此这种方法一般只做为调试的一种手段。

def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b)
    except:
        return float('inf')
    
def dj(theta, X_b, y):
    res = np.empty(len(theta))
    res[0] = np.sum(X_b.dot(theta) - y)
    for i in range(1, len(theta)):
        res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
        
    return res * 2 / len(X_b)

def dJ_math(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2 / len(y)

def dJ_debug(theta, X_b, y, epsilon=0.01):
    res = np.empty(len(theta))
    for i in range(len(theta)):
        theta_1 = theta.copy()
        theta_1[i] += epsilon
        theta_2 = theta.copy()
        theta_2[i] -= epsilon
        res[i] = (J(theta_1, X_b, y) - J(theta_2, X_b, y)) / (2 * epsilon)
        
    return res

def gradient_descent(dJ, X_b, y, init_theta, eta, n_iters=1e4, epsilon=1e-8):
    
    theta = init_theta
    i_iter = 0
    
    while i_iter < n_iters:
        gradient = dj(theta, X_b, y)
        last_theta = theta
        theta = last_theta - eta * gradient
        
        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break
        
        i_iter += 1
        
    return theta

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(666)
X = np.random.random(size=(1000, 10))
true_theta = np.arange(1, 12, dtype=float)
X_b = np.hstack([np.ones((len(X), 1)), X])
y = X_b.dot(true_theta) + np.random.random(size=1000)
X_b = np.hstack([np.ones((len(X), 1)), X])
init_theta = np.zeros(X_b.shape[1])

eta = 0.001
gradient_descent(dJ_debug, X_b, y, init_theta, eta)

输出结果：

array([6.4206869 , 2.08303468, 2.56076208, 3.15680035, 4.41844686,
       5.2150793 , 5.80241598, 6.80736408, 7.5672971 , 8.47769298,
       9.31966978])

theta = gradient_descent(dJ_math, X_b, y, init_theta, eta)
theta

输出结果：

array([6.4206869 , 2.08303468, 2.56076208, 3.15680035, 4.41844686,
       5.2150793 , 5.80241598, 6.80736408, 7.5672971 , 8.47769298,
       9.31966978])

我是尾巴

梯度下降是迭代法的一种,可以用于求解最小二乘问题(线性和非线性都可以)。在求解机器学习算法的模型参数，即无约束优化问题时，梯度下降（Gradient Descent）是最常采用的方法之一，另一种常用的方法是最小二乘法。在求解损失函数的最小值时，可以通过梯度下降法来一步步的迭代求解，得到最小化的损失函数和模型参数值。

本次推荐：图片在线压缩工具

据说好多人早上看时间不是为了起床，而是看还能睡多久。我也一样！

继续加油~

来源：https://www.cnblogs.com/zhangkanghui/p/11322932.html

标签

梯度下降