梯度下降与最小二乘

tech2022-08-27  127

预备知识

一元函数泰勒公式

f ( x + Δ x ) = f ( x ) + f ′ ( x ) Δ x + 1 2 f ′ ′ ( x ) Δ x 2 + . . . f(x+\Delta x)=f(x)+{f}'(x)\Delta x+\frac{1}{2}{{f}}''(x)\Delta x^{2}+... f(x+Δx)=f(x)+f(x)Δx+21f(x)Δx2+...

多元函数泰勒展开

f ( x → + Δ x → ) = f ( x → ) + [ ∇ f ( x → ) ] T Δ x → + 1 2 Δ x → T ∇ 2 f ( x → ) Δ x → + . . . f(\overrightarrow{x}+\Delta{\overrightarrow{x}})=f(\overrightarrow{x})+[\nabla f(\overrightarrow{x})]^{T}\Delta \overrightarrow{x}+\frac{1}{2}\Delta \overrightarrow{x}^{T} \nabla ^{2}f(\overrightarrow{x})\Delta \overrightarrow{x}+... f(x +Δx )=f(x )+[f(x )]TΔx +21Δx T2f(x )Δx +...

其中, [ ∇ f ( x → ) ] T = [ ∂ f ∂ x 1 , ∂ f ∂ x 2 . . . ∂ f ∂ x d ] , ∇ 2 f ( x → ) = [ ∂ 2 f ∂ x 1 ∂ x 1 ∂ 2 f ∂ x 2 ∂ x 1 ⋯ ∂ 2 f ∂ x d ∂ x 1 ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x 1 ∂ x d ⋯ ⋯ ∂ 2 f ∂ x d ∂ x d ] [\nabla f(\overrightarrow{x})]^{T}=[\frac{\partial f}{\partial x_{1}},\frac{\partial f}{\partial x_{2}}...\frac{\partial f}{\partial x_{d}}],\nabla ^{2}f(\overrightarrow{x})=\begin{bmatrix} \frac{\partial^{2} f}{\partial x _{1}\partial x _{1}}& \frac{\partial^{2} f}{\partial x_{2}\partial x _{1}} &\cdots &\frac{\partial^{2} f}{\partial x_{d}\partial x_{1}} \\ \vdots & \vdots &\ddots &\vdots \\ \frac{\partial^{2} f}{\partial x _{1}\partial x _{d}}& \cdots & \cdots & \frac{\partial^{2} f}{\partial x _{d}\partial x _{d}} \end{bmatrix} [f(x )]T=[x1f,x2f...xdf],2f(x )=x1x12fx1xd2fx2x12fxdx12fxdxd2f

极值点(极小值)

∇ f ( x → ) = 0 , ∇ 2 f ( x → ) > 0 \nabla f(\overrightarrow{x})=0 ,\nabla ^{2}f(\overrightarrow{x})>0 f(x )=0,2f(x )>0

其中,满足 ∇ 2 f ( x → ) > 0 \nabla ^{2}f(\overrightarrow{x})>0 2f(x )>0的对称矩阵称为正定矩阵,充要条件为特征值大于零或者各阶主子式大于零

梯度下降法

原理

求极小值,为保证: f ( x → + Δ x → ) − f ( x → ) = [ ∇ f ( x → ) ] T Δ x → < 0 f(\overrightarrow{x}+\Delta{\overrightarrow{x}})-f(\overrightarrow{x})=[\nabla f(\overrightarrow{x})]^{T}\Delta \overrightarrow{x}<0 f(x +Δx )f(x )=[f(x )]TΔx <0,取: Δ x → = − α ∇ f ( x → ) \Delta{\overrightarrow{x}}=-\alpha \nabla f(\overrightarrow{x}) Δx =αf(x )

为保证泰勒展开在领域内成立的条件,取: Δ x → = − α ∇ f ( x → ) \Delta{\overrightarrow{x}}=-\alpha \nabla f(\overrightarrow{x}) Δx =αf(x )

步骤

取初始值 x i , i = ( 1 , 2 , ⋯   , n ) x_{i},i=(1,2,\cdots,n) xi,i=(1,2,,n) ∇ f ( x i → ) , Δ x i → = − ∇ f ( x i → ) \nabla f(\overrightarrow{x_{i}}),\Delta{\overrightarrow{x_{i}}}=-\nabla f(\overrightarrow{x_{i}}) f(xi ),Δxi =f(xi ) x i + 1 → = x i → + α Δ x → \overrightarrow{x_{i+1}}=\overrightarrow{x_{i}}+\alpha \Delta{\overrightarrow{x}} xi+1 =xi +αΔx 计算 ∥ f ( x i + 1 → ) − f ( x i → ) ∥ 2 2 ≤ ε \parallel f(\overrightarrow{x_{i+1}})-f(\overrightarrow{x_{i}})\parallel_{2}^{2}\leq\varepsilon f(xi+1 )f(xi )22ε,若不等式成立则停止,否则 i = i + 1 i=i+1 i=i+1,重复2,3,4

最小二乘法

步骤

有n个数据对 { x i → , y i } \{\overrightarrow{x_{i}},y_{i}\} {xi ,yi},其中 x i → \overrightarrow{x_{i}} xi 是行向量( i = 1 , 2 , 3 ⋯   , d i=1,2,3\cdots,d i=1,2,3,d

构造常数项 { β 0 , β 1 , ⋯ β d } \{\beta _{0},\beta _{1},\cdots\beta _{d} \} {β0,β1,βd};误差计算公式为: ε = y i − ( x i → β → + β 0 ) , i = 1 , 2 , 3 , ⋯   , d \varepsilon =y_{i}-(\overrightarrow{x_{i}}\overrightarrow{\beta}+\beta _{0}),i=1,2,3,\cdots,d ε=yi(xi β +β0),i=1,2,3,,d;其中 β → = { β 1 , β 2 , ⋯ β d } T \overrightarrow{\beta }=\{\beta _{1},\beta _{2},\cdots\beta _{d} \}^{T} β ={β1,β2,βd}T

损失函数为:

J ( β → , β 0 ) = 1 n ∑ i = 1 n ε 2 = 1 n ∑ i = 1 n ( y i − ( x i → β → + β 0 ) ) 2 = 1 n ∑ i = 1 n ( y i − ( ∑ j = 1 d x i j β j + β 0 ) ) 2 J(\overrightarrow{\beta},\beta_{0})=\frac{1}{n}\sum_{i=1}^{n}\varepsilon^{2}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-(\overrightarrow{x_{i}}\overrightarrow{\beta}+\beta _{0}))^{2} =\frac{1}{n}\sum_{i=1}^{n}(y_{i}-(\sum_{j=1}^{d}x_{ij}\beta _{j} +\beta _{0}))^{2} J(β ,β0)=n1i=1nε2=n1i=1n(yi(xi β +β0))2=n1i=1n(yi(j=1dxijβj+β0))2

x i 0 = 1 ( i = 1 , 2 , ⋯   , n ) , β → = { β 0 , β 1 , ⋯ β d } T x_{i0}=1(i=1,2,\cdots,n),\overrightarrow{\beta }=\{\beta _{0},\beta _{1},\cdots\beta _{d} \}^{T} xi0=1(i=1,2,,n),β ={β0,β1,βd}T,则可以写成 J ( β → ) = 1 n ( ∑ i = 1 n ( y i − x i → β → ) ) 2 J(\overrightarrow{\beta})=\frac{1}{n}(\sum_{i=1}^{n}(y_{i}-\overrightarrow{x_{i}}\overrightarrow{\beta}))^{2} J(β )=n1(i=1n(yixi β ))2

进一步,令 X = [ 1 x 11 ⋯ x 1 d ⋮ ⋮ ⋱ 1 x n 1 ⋯ x n d ] X=\begin{bmatrix}1 & x_{11} &\cdots&x_{1d}\\\vdots& \vdots& \ddots& \\1& x_{n1}&\cdots&x_{nd}\end{bmatrix} X=11x11xn1x1dxnd Y = [ y 1 , y 2 , ⋯   , y n ] T Y=\begin{bmatrix}y_{1},y_{2},\cdots,y_{n}\\\end{bmatrix}^{T} Y=[y1,y2,,yn]T, B = [ β 0 , β 1 , ⋯   , β d ] T B=[\beta_{0},\beta_{1},\cdots,\beta_{d}]^{T} B=[β0,β1,,βd]T

J ( B ) = 1 n ( Y − X B ) T ( Y − X B ) J(B)=\frac{1}{n}(Y-XB)^{T}(Y-XB) J(B)=n1(YXB)T(YXB)

利用梯度下降法求解

参考matrix cookbook,对矩阵进行展开求导

J ( B ) = 1 n ( Y − X B ) T ( Y − X B ) = 1 n ( Y T Y − Y T X B − B T X T Y + B T X T X B ) J(B)=\frac{1}{n}(Y-XB)^{T}(Y-XB)=\frac{1}{n}(Y^{T}Y-Y^{T}XB-B^{T}X^{T}Y+B^{T}X^{T}XB) J(B)=n1(YXB)T(YXB)=n1(YTYYTXBBTXTY+BTXTXB)

∇ J ( B ) = 1 n ( − X T Y − X T Y + 2 X T X B ) = 1 n ( − 2 X T Y + 2 X T X B ) \nabla J(B)=\frac{1}{n}(-X^{T}Y-X^{T}Y+2X^{T}XB)=\frac{1}{n}(-2X^{T}Y+2X^{T}XB) J(B)=n1(XTYXTY+2XTXB)=n1(2XTY+2XTXB)

​ 令梯度为零,则有: − 2 X T Y + 2 X T X B = 0 , B = ( X T X ) − 1 X T Y -2X^{T}Y+2X^{T}XB=0,B=(X^{T}X)^{-1}X^{T}Y 2XTY+2XTXB=0,B=(XTX)1XTY

B i + 1 = B i − α ∇ J ( B i ) B_{i+1}={B_{i}}-\alpha \nabla J(B_{i}) Bi+1=BiαJ(Bi)计算 ∥ J ( B i + 1 ) − J ( B i ) ∥ 2 2 ≤ ε \parallel J(B_{i+1})-J(B_{i})\parallel_{2}^{2}\leq\varepsilon J(Bi+1)J(Bi)22ε,若不等式成立则停止,否则 i = i + 1 i=i+1 i=i+1,重复1,2,3

附注:几类特殊函数的梯度公式

∇ ( b T X ) = b \nabla (b^{T}X)=b (bTX)=b ∇ ( X T b ) = b \nabla (X^{T}b)=b (XTb)=b ∇ ( X T X ) = 2 X \nabla (X^{T}X)=2X (XTX)=2X ∇ ( X T A X ) = 2 A X \nabla (X^{T}AX)=2AX (XTAX)=2AX(其中A为对称矩阵)

随机梯度下降法(Stochastic gradient decent,SGD)

伪代码

network = TwoLayerNet(...) optimizer = SGD() for i in range(10000): ... x_batch, t_batch = get_mini_batch(...) # mini-batch grads = network.gradient(x_batch, t_batch) params = network.params optimizer.update(params, grads) ...
最新回复(0)