对于梯度下降法,其前提是目标函数f(x)f(x)f(x)是一阶可微的,但是在实际应用中经常会遇到不可微的函数,对于这类函数我们无法在每个点处求出梯度,但往往它们的最优值都是在不可微点处取到的,所以为了能够处理这种情形,本节介绍次梯度算法
假设f(x)f(x)f(x)为凸函数,但不一定可微,考虑如下问题
minxf(x)\mathop{min}\limits_{x} f(x)xminf(x)
为了极小化一个不可微的凸函数fff,可类似梯度法构造如下次梯度算法的迭代格式
xk+1=xk−αkgk,gk∈∂f(xk)x^{k+1}=x^{k}-\alpha_{k}g^{k}, \quad g^{k}\in \partial f(x^{k})xk+1=xk−αkgk,gk∈∂f(xk)
其中αk>0\alpha_{k}>0αk>0,通常有如下4种选择方式
考虑LASSO问题:
minf(x)=12∣∣Ax−b∣∣2+u∣∣x∣∣1min \quad f(x)=\frac{1}{2}||Ax-b||^{2}+u||x||_{1}minf(x)=21∣∣Ax−b∣∣2+u∣∣x∣∣1
容易得知f(x)f(x)f(x)的一个次梯度为
g=AT(Ax−b)+usign(x)g=A^{T}(Ax-b)+usign(x)g=AT(Ax−b)+usign(x)
因此,LASSO问题的次梯度算法为
xk+1=xk−αk(AT(Axk−b)+usign(xk))x^{k+1}=x^{k}-\alpha_{k}(A^{T}(Ax^{k}-b)+usign(x^{k}))xk+1=xk−αk(AT(Axk−b)+usign(xk))
如下,正则化参数u=1u=1u=1,分别选取固定步长ak=0.0005a_{k}=0.0005ak=0.0005,ak=0.0002a_{k}=0.0002ak=0.0002,ak=0.0001a_{k}=0.0001ak=0.0001以及消失步长ak=0.002ka_{k}=\frac{0.002}{\sqrt{k}}ak=k0.002。可以发现,在3000步以内,使用不同的固定步长最终会到达次优解,函数值下降到一定程度便稳定在某个值附近;而使用消失步长算法最终将会收敛;另外次梯度算法本身是非单调方法,因此在求解过程中需要记录历史中最好的一次迭代来作为算法的最终输出

对于u=10−2,10−3u=10^{-2},10^{-3}u=10−2,10−3,我们采用连续化次梯度算法进行求解

使用梯度下降法,我们给出的迭代格式为
xk+1=xk−αk∇f(xk)x^{k+1}=x^{k}-\alpha_{k}\nabla f(x^{k})xk+1=xk−αk∇f(xk)
由于梯度下降的基本策略是沿一阶最速下降方向迭代。当∇2f(x)\nabla^{2}f(x)∇2f(x)的条件数较大时,它的收敛速度比较缓慢(只用到了一阶信息)。因此如果f(x)f(x)f(x)足够光滑,我们可以利用f(x)f(x)f(x)的二阶信息改进下降方向,以加速算法的迭代
经典牛顿法:对于可微二次函数f(x)f(x)f(x),对于第kkk步迭代,我们考虑目标函数fff在点xkx_{k}xk的二阶泰勒近似

忽略高阶项o(∣∣dk∣∣2)o(||d^{k}||^{2})o(∣∣dk∣∣2),并将等式右边视作dkd^{k}dk的函数并极小化,则
∇2f(xk)dk=−∇f(xk)\nabla^{2}f(x^{k})d^{k}=-\nabla f(x^{k})∇2f(xk)dk=−∇f(xk)
上式称其为牛顿方程,dkd^{k}dk叫做牛顿方向
若∇2f(xk)\nabla^{2}f(x^{k})∇2f(xk)非奇异,则可得到迭代格式
xk+1=xk−∇2f(xk)−1∇f(xk)x^{k+1}=x^{k}-\nabla^{2}f(x^{k})^{-1}\nabla f(x^{k})xk+1=xk−∇2f(xk)−1∇f(xk)
此迭代格式称之为经典牛顿法
经典牛顿法收敛速度很快,但是在使用中会有一些限制因素

在实际应用中xk+1=xk−∇2f(xk)−1∇f(xk)x^{k+1}=x^{k}-\nabla^{2}f(x^{k})^{-1}\nabla f(x^{k})xk+1=xk−∇2f(xk)−1∇f(xk)几乎是不能使用的,这是因为
修正牛顿法:为了克服上述缺陷,我们对经典牛顿法做出某种修正或者变形,提出了修正牛顿法,使其成为真正可以使用的算法。这里介绍带线搜索的修正牛顿法,其基本思想是对牛顿方程中的海瑟矩阵∇2f(xk)\nabla^{2}f(x^{k})∇2f(xk)进行修正,使其变为正定矩阵,同时引入线搜索以改善算法稳定性,其算法框架如下

上述算法中,BkB^{k}Bk(即EkE^{k}Ek)的选取较为关键,需要注意




import numpy as np
from sympy import *# 计算梯度函数
def cal_grad_funciton(function):res = []x = []x.append(Symbol('x1'))x.append(Symbol('x2'))for i in range(len(x)):res.append(diff(function, x[i]))return res# 定义目标函数
def function_define():x = []x.append(Symbol('x1'))x.append(Symbol('x2'))# 课本y = 2 * (x[0] - x[1] ** 2) ** 2 + (1 + x[1]) ** 2# Rosenbrock函数# y = 100 * (x[1] - x[0] ** 2) ** 2 + (1 - x[0]) ** 2return y# 计算函数返回值
def cal_function_ret(function, x):res = function.subs([('x1', x[0]), ('x2', x[1])])return res# 计算梯度
def cal_grad_ret(grad_function, x):res = []for i in range(len(x)):res.append(grad_function[i].subs([('x1', x[0]), ('x2', x[1])]))return res# 海瑟矩阵定义
def Hessian_define(function):x = []x.append(Symbol('x1'))x.append(Symbol('x2'))H = zeros(2, 2)function = sympify([function])for i, fi in enumerate(function):for j, r in enumerate(x):for k, s in enumerate(x):H[j ,k] = diff(diff(fi, r), s)H = np.array(H)return H# 海瑟矩阵计算
def cal_hessian_ret(hessian_function, x):res = []for i in range(2):temp = []for j in range(2):temp.append(hessian_function[i][j].subs([('x1', x[0]), ('x2', x[1])]))res.append(temp) return res# armijo-goldstein准则
def armijo_goldstein(function, grad_function, x0, d, c = 0.3):# 指定初始步长alpha = 1.0# 回退参数gamma = 0.333# 迭代次数k = 1.0# fd 表示 f(x + alpha * d)fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) # fk 表示 f(x) + c * alpha * gradf(x) * dfk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)# fp 表示 f(x) + (1-c) * alpha * gradf(x) * dfp = cal_function_ret(function, x0) + (1-c) * alpha * np.dot(cal_grad_ret(grad_function, x0), d)while fd > fk or fd < fp:alpha *= gammafd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) fk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)fp = cal_function_ret(function, x0) + (1-c) * alpha * np.dot(cal_grad_ret(grad_function, x0), d)k += 1.0print("迭代次数:", k)return alpha# armijo-wolfe准则
def armijo_wlofe(function, grad_function, x0, d, c = 0.3):# wolfe准则参数c2 = 0.9# 指定初始步长alpha = 1.0# 二分法确定alpaha, b = 0, np.inf# 迭代次数k = 1.0# gk表示gradf(x)gk = cal_grad_ret(grad_function, x0)# fd 表示 f(x + alpha * d)fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) # fk 表示 f(x) + c * alpha * gradf(x) * dfk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)# gp表示gradf(x + alpha * d)gp = cal_grad_ret(grad_function, np.add(x0, np.dot(alpha, d)))while True:if fd > fk:b = alphaalpha = (a + b) / 2fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) fk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)gp = cal_grad_ret(grad_function, np.add(x0, np.dot(alpha, d)))k = k + 1continueif np.dot(gp, d) < c2 * np.dot(gk, d):a = alphaalpha = np.min(2 * alpha, (a + b) / 2)fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) fk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)gp = cal_grad_ret(grad_function, np.add(x0, np.dot(alpha, d)))k = k + 1continuebreakreturn alpha# 范数计算
def norm(x):return sqrt(np.dot(x, x))# 牛顿法
def Newton(function, hessian_function ,grad_function, x0, delta):k = 0 # 迭代次数xk = x0 # 初始迭代点gk = cal_grad_ret(grad_function, xk)while norm(gk) > 0.00001 and k < 10000:gk2 = cal_hessian_ret(hessian_function, xk) # 海瑟矩阵dk = np.dot(-1, np.dot(np.linalg.inv(np.array(gk2, dtype=float)), gk)) # 下降方向alpha = 0.01# 如果delta为0,利用wolf准则求步长if delta == 0:alpha = armijo_wlofe(function, grad_function, xk, dk)# 如果delta为1,利用glodstein准则求步长if delta == 1:alpha = armijo_goldstein(function, grad_function, xk, dk)xk = np.add(xk , np.dot(alpha, dk)) # 步长为1k = k + 1gk = cal_grad_ret(grad_function, xk) # 更新梯度print("xk", xk)print("迭代次数", k)return cal_function_ret(function, xk)if __name__ == '__main__':function = function_define()grad_function = cal_grad_funciton(function)Hessian_function = Hessian_define(function)print("函数", function)print("梯度函数", grad_function)print("海瑟矩阵", Hessian_function)# 计算(0, 0)处的函数值print("函数值:", cal_function_ret(function, [1, 1]))# 计算(0, 0)处的梯度值print("梯度值:", cal_grad_ret(grad_function, [1, 1]))# 计算(0, 0)处海瑟矩阵print("海瑟矩阵值:", cal_hessian_ret(Hessian_function, [1, 1]))# print("最小值:", Newton(function, Hessian_function ,grad_function, [1, 1], 3))

拟牛顿法:牛顿法虽然收敛速度快,但是计算过程中需要计算目标函数的二阶偏导数计算复杂度较大,而且有时目标函数的海瑟矩阵无法保持正定,从而使得牛顿法失效,因此拟牛顿法提出用不含二阶导数的矩阵UtU_{t}Ut替代牛顿法中的Ht−1H_{t}^{-1}Ht−1,然后沿搜索方向−Utgt-U_{t}g_{t}−Utgt做一维搜索,在“拟牛顿”的条件下优化目标函数不同的构造方法就产生了不同的拟牛顿法
拟牛顿条件:对海瑟矩阵的逆在做近似时不能随便乱来,需要一定的理论指导,而拟牛顿条件则是用来提供理论指导的,指出了用来近似的矩阵应该满足的条件
设经过k+1k+1k+1次迭代后得到xk+1x_{k+1}xk+1,此时将目标函数f(x)f(x)f(x)在xk+1x_{k+1}xk+1附近作泰勒展开,取二阶近似,得

在上式两边同时作用一个梯度算子∇\nabla∇,可得
∇f(x)≈∇f(xk+1)+Hk+1⋅(x−xk+1)\nabla f(x) \approx \nabla f(x_{k+1})+H_{k+1}·(x-x_{k+1})∇f(x)≈∇f(xk+1)+Hk+1⋅(x−xk+1)
上式中取x=xkx=x_{k}x=xk,整理后可得
gk+1−gk≈Hk+1⋅(xk+1−xk)g_{k+1}-g_{k}\approx H_{k+1}·(x_{k+1}-x_{k})gk+1−gk≈Hk+1⋅(xk+1−xk)
引入记号
sk=xk+1−xk,yk=gk+1−gks_{k}=x_{k+1}-x_{k},y_{k}=g_{k+1}-g_{k}sk=xk+1−xk,yk=gk+1−gk
则上式可写为
yk≈Hk+1⋅sk①y_{k}\approx H_{k+1}·s_{k}\quad ①yk≈Hk+1⋅sk①
或
sk≈Hk+1−1⋅yk②s_{k}\approx H_{k+1}^{-1}·y_{k}\quad ②sk≈Hk+1−1⋅yk②
①,②①,②①,②两式便是拟牛顿条件,它对迭代过程中的海瑟矩阵Hk+1H_{k+1}Hk+1作约束,因此对Hk+1H_{k+1}Hk+1做近似的Bk+1B_{k+1}Bk+1以及对Hk+1−1H^{-1}_{k+1}Hk+1−1做近似的Dk+1D_{k+1}Dk+1可以将yk=Bk+1⋅sky_{k}= B_{k+1}·s_{k}yk=Bk+1⋅sk或sk=Dk+1⋅yks_{k}= D_{k+1}·y_{k}sk=Dk+1⋅yk作为指导
DFP算法:是最早的拟牛顿法,会以以下迭代方法,对Hk+1−1H_{k+1}^{-1}Hk+1−1作近似
Dk+1=Dk+△Dk,k=0,1,2....D_{k+1}=D_{k}+△ D_{k},\quad k=0,1, 2....Dk+1=Dk+△Dk,k=0,1,2....
其中校正矩阵△Dk△ D_{k}△Dk
△Dk=skskTskTyk−DkykykTDkykTDkyk△ D_{k}=\frac{s_{k}s_{k}^{T}}{s_{k}^{T}y_{k}}-\frac{D_{k}y_{k}y_{k}^{T}D_{k}}{y_{k}^{T}D_{k}y_{k}}△Dk=skTykskskT−ykTDkykDkykykTDk
其算法逻辑如下

import numpy as np
import math
from sympy import *# 计算梯度函数
def cal_grad_funciton(function):res = []x = []x.append(Symbol('x1'))x.append(Symbol('x2'))for i in range(len(x)):res.append(diff(function, x[i]))return res# 定义目标函数
def function_define():x = []x.append(Symbol('x1'))x.append(Symbol('x2'))# 课本y = 2 * (x[0] - x[1] ** 2) ** 2 + (1 + x[1]) ** 2# Rosenbrock函数# y = 100 * (x[1] - x[0] ** 2) ** 2 + (1 - x[0]) ** 2return y# 计算函数返回值
def cal_function_ret(function, x):res = function.subs([('x1', x[0]), ('x2', x[1])])return res# 计算梯度
def cal_grad_ret(grad_function, x):res = []for i in range(len(x)):res.append(grad_function[i].subs([('x1', x[0]), ('x2', x[1])]))return res# 海瑟矩阵定义
def Hessian_define(function):x = []x.append(Symbol('x1'))x.append(Symbol('x2'))H = zeros(2, 2)function = sympify([function])for i, fi in enumerate(function):for j, r in enumerate(x):for k, s in enumerate(x):H[j ,k] = diff(diff(fi, r), s)H = np.array(H)return H# 海瑟矩阵计算
def cal_hessian_ret(hessian_function, x):res = []for i in range(2):temp = []for j in range(2):temp.append(hessian_function[i][j].subs([('x1', x[0]), ('x2', x[1])]))res.append(temp) return res# armijo-goldstein准则
def armijo_goldstein(function, grad_function, x0, d, c = 0.3):# 指定初始步长alpha = 1.0# 回退参数gamma = 0.333# 迭代次数k = 1.0# fd 表示 f(x + alpha * d)fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) # fk 表示 f(x) + c * alpha * gradf(x) * dfk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)# fp 表示 f(x) + (1-c) * alpha * gradf(x) * dfp = cal_function_ret(function, x0) + (1-c) * alpha * np.dot(cal_grad_ret(grad_function, x0), d)while fd > fk or fd < fp:alpha *= gammafd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) fk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)fp = cal_function_ret(function, x0) + (1-c) * alpha * np.dot(cal_grad_ret(grad_function, x0), d)k += 1.0print("迭代次数:", k)return alpha# armijo-wolfe准则
def armijo_wlofe(function, grad_function, x0, d, c = 0.3):# wolfe准则参数c2 = 0.9# 指定初始步长alpha = 1.0# 二分法确定alpaha, b = 0, np.inf# 迭代次数k = 1.0# gk表示gradf(x)gk = cal_grad_ret(grad_function, x0)# fd 表示 f(x + alpha * d)fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) # fk 表示 f(x) + c * alpha * gradf(x) * dfk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)# gp表示gradf(x + alpha * d)gp = cal_grad_ret(grad_function, np.add(x0, np.dot(alpha, d)))while True:if fd > fk:b = alphaalpha = (a + b) / 2fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) fk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)gp = cal_grad_ret(grad_function, np.add(x0, np.dot(alpha, d)))k = k + 1continueif np.dot(gp, d) < c2 * np.dot(gk, d):a = alphaalpha = np.min(2 * alpha, (a + b) / 2)fd = cal_function_ret(function, np.add(x0, np.dot(alpha, d))) fk = cal_function_ret(function, x0) + c * alpha * np.dot(cal_grad_ret(grad_function, x0), d)gp = cal_grad_ret(grad_function, np.add(x0, np.dot(alpha, d)))k = k + 1continuebreakreturn alpha# 范数计算
def norm(x):return math.sqrt(np.dot(x, x))# 牛顿法
def Newton(function, hessian_function ,grad_function, x0, delta):k = 0 # 迭代次数xk = x0 # 初始迭代点gk = cal_grad_ret(grad_function, xk)while norm(gk) > 0.0001 and k < 10000:gk2 = cal_hessian_ret(hessian_function, xk) # 海瑟矩阵dk = np.dot(-1, np.dot(np.linalg.inv(np.array(gk2, dtype=float)), gk)) # 下降方向alpha = 0.01# 如果delta为0,利用wolf准则求步长if delta == 0:alpha = armijo_wlofe(function, grad_function, xk, dk)# 如果delta为1,利用glodstein准则求步长if delta == 1:alpha = armijo_goldstein(function, grad_function, xk, dk)xk = np.add(xk , np.dot(alpha, dk)) k = k + 1gk = cal_grad_ret(grad_function, xk) # 更新梯度print("xk", xk)print("迭代次数", k)return cal_function_ret(function, xk)# 拟牛顿法(DFP)
def DFP_Newton(function, grad_function, x0):xk = x0alpha = 0.001 # 初始步长gk = cal_grad_ret(grad_function, xk)D = np.eye(2) # 初始化正定矩阵 k = 1while k < 10000:dk = np.dot(-1, np.dot(D, gk)) # 下降方向xk_new = np.add(xk , np.dot(alpha, dk)) gk_new = cal_grad_ret(grad_function, xk_new) # 更新梯度if norm(gk_new) < 0.0001:break# 计算y和sy = np.array(gk_new) - np.array(gk)s = np.array(xk_new) - np.array(xk)# 更新正定矩阵DD = D + np.dot(s, s.T)/np.dot(s.T, y) - np.dot(D, np.dot(y, np.dot(y.T, D))) / np.dot(y.T, np.dot(D, y))k = k + 1 gk = gk_new xk = xk_new # 更新xkprint("xk", xk)print("迭代次数", k)return cal_function_ret(function, xk)if __name__ == '__main__':function = function_define()grad_function = cal_grad_funciton(function)Hessian_function = Hessian_define(function)print("函数", function)print("梯度函数", grad_function)print("海瑟矩阵", Hessian_function)# 计算(0, 0)处的函数值print("函数值:", cal_function_ret(function, [1, 1]))# 计算(0, 0)处的梯度值print("梯度值:", cal_grad_ret(grad_function, [1, 1]))# 计算(0, 0)处海瑟矩阵print("海瑟矩阵值:", cal_hessian_ret(Hessian_function, [1, 1]))# print("最小值:", DFP_Newton(function, grad_function, [1, 1]))
