译|Gradient Descent in Python

译|Gradient Descent in Python

当你初次涉足机器学习时,你学习的第一个基本算法就是 梯度下降 (Gradient Descent), 可以说梯度下降法是机器学习算法的支柱。 在这篇文章中,我尝试使用 python$python$ 解释梯度下降法的基本原理。一旦掌握了梯度下降法,很多问题就会变得容易理解,并且利于理解不同的算法。

1
2
3
4
import numpy as np
import matplotlib.pyplot as plt
plt.style.use(['ggplot'])
%matplotlib inline

如果你想尝试自己实现梯度下降法, 你需要加载基本的 p$python$ $packages$ —— $numpy$ and $matplotlib$

首先, 我们将创建包含着噪声的线性数据

1
2
3
# 随机创建一些噪声
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

接下来通过 matplotlib 可视化数据

1
2
3
4
5
# 可视化数据
plt.plot(X, y, 'b.')
plt.xlabel("$x$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])

|

显然, y$y$ 与 x$x$ 具有良好的线性关系,这个数据非常简单,只有一个自变量 x$x$.

我们可以将其表示为简单的线性关系:

y=b+mx

$$y=b+mx$$

并求出 b$b$ , m$m$。

这种被称为解方程的分析方法并没有什么不妥,但机器学习是涉及矩阵计算的,因此我们使用矩阵法(向量法)进行分析。

我们将 y$y$ 替换成 J(θ)$J(θ)$, b$b$ 替换成 θ0$θ0$, m$m$ 替换成 θ1$θ1$。
得到如下表达式:

J(θ)=θ0+θ1x

$$J(θ)=θ0+θ1x$$

注意: 本例中 θ0=4$θ0=4$, θ1=3$θ1=3$

求解 θ0$θ0$ 和 θ1$θ1$ 的分析方法,代码如下:

|

1
2
3
4
1  
2
3

|

1
2
3
4
X\_b = np.c\_\[np.ones((100, 1)), X\] # 为X添加了一个偏置单位,对于X中的每个向量都是1  
theta\_best = np.linalg.inv(X\_b.T.dot(X\_b)).dot(X\_b.T).dot(y)
theta\_best

|

1
2
array([[3.86687149],
[3.12408839]])

不难发现这个值接近真实的 θ0$θ0$,θ1$θ1$,由于我在数据中引入了噪声,所以存在误差。

|

1
2
3
4
5
1  
2
3
4

|

1
2
3
4
5
X\_new = np.array(\[\[0\], \[2\]\])  
X\_new\_b = np.c\_\[np.ones((2, 1)), X\_new\]
y\_predict = X\_new\_b.dot(theta\_best)
y\_predict

|

1
2
array([[ 3.86687149],
[10.11504826]])

梯度下降法 (Gradient Descent)

Cost Function & Gradients

计算代价函数和梯度的公式如下所示。

注意:代价函数用于线性回归,对于其他算法,代价函数是不同的,梯度必须从代价函数中推导出来。

Cost

J(θ)=1/2mm∑i=1(h(θ)(i)−y(i))2

$$J(θ)=1/2m∑i=1m(h(θ)(i)−y(i))2$$

Gradient

∂J(θ)∂θj=1/mm∑i=1(h(θ(i)−y(i)).X(i)j

$$∂J(θ)∂θj=1/m∑i=1m(h(θ(i)−y(i)).Xj(i)$$

Gradients

θ0:=θ0−α.(1/m.m∑i=1(h(θ(i)−y(i)).X(i)0)

$$θ0:=θ0−α.(1/m.∑i=1m(h(θ(i)−y(i)).X0(i))$$

θ1:=θ1−α.(1/m.m∑i=1(h(θ(i)−y(i)).X(i)1)

$$θ1:=θ1−α.(1/m.∑i=1m(h(θ(i)−y(i)).X1(i))$$

θ2:=θ2−α.(1/m.m∑i=1(h(θ(i)−y(i)).X(i)2)

$$θ2:=θ2−α.(1/m.∑i=1m(h(θ(i)−y(i)).X2(i))$$

θj:=θj−α.(1/m.m∑i=1(h(θ(i)−y(i)).X(i)0)

$$θj:=θj−α.(1/m.∑i=1m(h(θ(i)−y(i)).X0(i))$$

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1  
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def cal\_cost(theta, X, y):  
'''
Calculates the cost for given X and Y. The following shows and example of a single dimensional X
theta = Vector of thetas
X = Row of X's np.zeros((2,j))
y = Actual y's np.zeros((2,1))

where:
j is the no of features
'''

m = len(y)

predictions = X.dot(theta)
cost = (1/2\*m) \* np.sum(np.square(predictions - y))

return cost

|
|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1  
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def gradient\_descent(X, y, theta, learning\_rate = 0.01, iterations = 100):  
'''
X = Matrix of X with added bias units
y = Vector of Y
theta=Vector of thetas np.random.randn(j,1)
learning\_rate
iterations = no of iterations

Returns the final theta vector and array of cost history over no of iterations
'''

m = len(y)
# learning\_rate = 0.01
# iterations = 100

cost\_history = np.zeros(iterations)
theta\_history = np.zeros((iterations, 2))
for i in range(iterations):
prediction = np.dot(X, theta)

theta = theta - (1/m) \* learning\_rate \* (X.T.dot((prediction - y)))
theta\_history\[i, :\] = theta.T
cost\_history\[i\] = cal\_cost(theta, X, y)

return theta, cost\_history, theta\_history

|
|

1
2
3
4
5
6
7
8
9
10
1  
2
3
4
5
6
7
8
9

|

1
2
3
4
5
6
7
8
9
10
\# 从1000次迭代开始,学习率为0.01。从高斯分布的θ开始  
lr =0.01
n\_iter = 1000
theta = np.random.randn(2, 1)
X\_b = np.c\_\[np.ones((len(X), 1)), X\]
theta, cost\_history, theta\_history = gradient\_descent(X\_b, y, theta, lr, n\_iter)

print('Theta0: {:0.3f},\\nTheta1: {:0.3f}'.format(theta\[0\]\[0\],theta\[1\]\[0\]))
print('Final cost/MSE: {:0.3f}'.format(cost\_history\[-1\]))

|

1
2
3
Theta0:          3.867,
Theta1: 3.124
Final cost/MSE: 5457.747

|

1
2
3
4
5
6
7
1  
2
3
4
5
6

|

1
2
3
4
5
6
7
\# 绘制迭代的成本图  
fig, ax = plt.subplots(figsize=(12,8))

ax.set\_ylabel('J(Theta)')
ax.set\_xlabel('Iterations')
ax.plot(range(1000), cost\_history, 'b.')

|

在大约 150 次迭代之后代价函数趋于稳定,因此放大到迭代200,看看曲线

|

1
2
3
1  
2

|

1
2
3
fig, ax = plt.subplots(figsize=(10,8))  
ax.plot(range(200), cost\_history\[:200\], 'b.')

|

值得注意的是,最初成本下降得更快,然后成本降低的收益就不那么多了。 我们可以尝试使用不同的学习速率和迭代组合,并得到不同学习率和迭代的效果会如何。

让我们建立一个函数,它可以显示效果,也可以显示梯度下降实际上是如何工作的。

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1  
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def plot\_GD(n\_iter, lr, ax, ax1=None):  
'''
n\_iter = no of iterations
lr = Learning Rate
ax = Axis to plot the Gradient Descent
ax1 = Axis to plot cost\_history vs Iterations plot
'''

ax.plot(X, y, 'b.')
theta = np.random.randn(2, 1)

tr = 0.1
cost\_history = np.zeros(n\_iter)
for i in range(n\_iter):
pred\_prev = X\_b.dot(theta)
theta, h, \_ = gradient\_descent(X\_b, y, theta, lr, 1)
pred = X\_b.dot(theta)

cost\_history\[i\] = h\[0\]

if ((i % 25 == 0)):
ax.plot(X, pred, 'r-', alpha=tr)
if tr < 0.8:
tr += 0.2

if not ax1 == None:
ax1.plot(range(n\_iter), cost\_history, 'b.')

|
|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1  
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
\# 绘制不同迭代和学习率组合的图  
fig = plt.figure(figsize=(30,25), dpi=200)
fig.subplots\_adjust(hspace=0.4, wspace=0.4)

it\_lr = \[(2000, 0.001), (500, 0.01), (200, 0.05), (100, 0.1)\]
count = 0
for n\_iter, lr in it\_lr:
count += 1

ax = fig.add\_subplot(4, 2, count)
count += 1

ax1 = fig.add\_subplot(4, 2, count)

ax.set\_title("lr:{}" .format(lr))
ax1.set\_title("Iterations:{}" .format(n\_iter))
plot\_GD(n\_iter, lr, ax, ax1)

|

通过观察发现,以较小的学习速率收集解决方案需要很长时间,而学习速度越大,学习速度越快。

|

1
2
3
1  
2

|

1
2
3
\_, ax = plt.subplots(figsize=(14, 10))  
plot\_GD(100, 0.1, ax)

|

随机梯度下降法(Stochastic Gradient Descent)

随机梯度下降法,其实和批量梯度下降法原理类似,区别在与求梯度时没有用所有的 m$m$ 个样本的数据,而是仅仅选取一个样本 j$j$ 来求梯度。对应的更新公式是:

θi=θi−α(hθ(x(j)0,x(j)1,…x(j)n)−yj)x(j)i

$$θi=θi−α(hθ(x0(j),x1(j),…xn(j))−yj)xi(j)$$

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1  
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def stocashtic\_gradient\_descent(X, y, theta, learning\_rate=0.01, iterations=10):  
'''
X = Matrix of X with added bias units
y = Vector of Y
theta=Vector of thetas np.random.randn(j,1)
learning\_rate
iterations = no of iterations

Returns the final theta vector and array of cost history over no of iterations
'''

m = len(y)
cost\_history = np.zeros(iterations)

for it in range(iterations):
cost = 0.0
for i in range(m):
rand\_ind = np.random.randint(0, m)
X\_i = X\[rand\_ind, :\].reshape(1, X.shape\[1\])
y\_i = y\[rand\_ind, :\].reshape(1, 1)
prediction = np.dot(X\_i, theta)

theta -= (1/m) \* learning\_rate \* (X\_i.T.dot((prediction - y\_i)))
cost += cal\_cost(theta, X\_i, y\_i)
cost\_history\[it\] = cost

return theta, cost\_history

|
|

1
2
3
4
5
6
7
8
9
1  
2
3
4
5
6
7
8

|

1
2
3
4
5
6
7
8
9
lr = 0.5  
n\_iter = 50
theta = np.random.randn(2,1)
X\_b = np.c\_\[np.ones((len(X),1)), X\]
theta, cost\_history = stocashtic\_gradient\_descent(X\_b, y, theta, lr, n\_iter)

print('Theta0: {:0.3f},\\nTheta1: {:0.3f}' .format(theta\[0\]\[0\],theta\[1\]\[0\]))
print('Final cost/MSE: {:0.3f}' .format(cost\_history\[-1\]))

|

1
2
3
Theta0:          3.762,
Theta1: 3.159
Final cost/MSE: 46.964

|

1
2
3
4
5
6
7
8
1  
2
3
4
5
6
7

|

1
2
3
4
5
6
7
8
fig, ax = plt.subplots(figsize=(10,8))  

ax.set\_ylabel('$J(\\Theta)$' ,rotation=0)
ax.set\_xlabel('$Iterations$')
theta = np.random.randn(2,1)

ax.plot(range(n\_iter), cost\_history, 'b.')

|

小批量梯度下降法(Mini-batch Gradient Descent)

小批量梯度下降法是批量梯度下降法和随机梯度下降法的折衷,也就是对于 m$m$ 个样本,我们采用 x$x$ 个样子来迭代,1<x<m$1<x<m$。一般可以取 x=10$x=10$,当然根据样本的数据,可以调整这个 x$x$ 的值。对应的更新公式是:

θi=θi−αt+x−1∑j=t(hθ(x(j)0,x(j)1,…x(j)n)−yj)x(j)i

$$θi=θi−α∑j=tt+x−1(hθ(x0(j),x1(j),…xn(j))−yj)xi(j)$$

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1  
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

|

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def minibatch\_gradient\_descent(X, y, theta, learning\_rate=0.01, iterations=10, batch\_size=20):  
'''
X = Matrix of X without added bias units
y = Vector of Y
theta=Vector of thetas np.random.randn(j,1)
learning\_rate
iterations = no of iterations

Returns the final theta vector and array of cost history over no of iterations
'''

m = len(y)
cost\_history = np.zeros(iterations)
n\_batches = int(m / batch\_size)

for it in range(iterations):
cost = 0.0
indices = np.random.permutation(m)
X = X\[indices\]
y = y\[indices\]
for i in range(0, m, batch\_size):
X\_i = X\[i: i+batch\_size\]
y\_i = y\[i: i+batch\_size\]

X\_i = np.c\_\[np.ones(len(X\_i)), X\_i\]
prediction = np.dot(X\_i, theta)

theta -= (1/m) \* learning\_rate \* (X\_i.T.dot((prediction - y\_i)))
cost += cal\_cost(theta, X\_i, y\_i)
cost\_history\[it\] = cost

return theta, cost\_history

|
|

1
2
3
4
5
6
7
8
1  
2
3
4
5
6
7

|

1
2
3
4
5
6
7
8
lr = 0.1  
n\_iter = 200
theta = np.random.randn(2, 1)
theta, cost\_history = minibatch\_gradient\_descent(X, y, theta, lr, n\_iter)

print('Theta0: {:0.3f},\\nTheta1: {:0.3f}' .format(theta\[0\]\[0\], theta\[1\]\[0\]))
print('Final cost/MSE: {:0.3f}' .format(cost\_history\[-1\]))

|

1
2
3
Theta0:          3.842,
Theta1: 3.146
Final cost/MSE: 1090.518

|

1
2
3
4
5
6
7
8
1  
2
3
4
5
6
7

|

1
2
3
4
5
6
7
8
fig, ax = plt.subplots(figsize=(10,8))  

ax.set\_ylabel('$J(\\Theta)$', rotation=0)
ax.set\_xlabel('$Iterations$')
theta = np.random.randn(2, 1)

ax.plot(range(n\_iter), cost\_history, 'b.')

|

_参考_:

# Gradient Descent, machine learning, python

作者

laugh12321

发布于

2019-03-09

更新于

2020-10-23

许可协议

CC BY-NC-SA 4.0

评论

Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×