一元函数泰勒公式
f ( x + Δ x ) = f ( x ) + f ′ ( x ) Δ x + 1 2 f ′ ′ ( x ) Δ x 2 + . . . f(x+\Delta x)=f(x)+{f}'(x)\Delta x+\frac{1}{2}{{f}}''(x)\Delta x^{2}+... f(x+Δx)=f(x)+f′(x)Δx+21f′′(x)Δx2+...
多元函数泰勒展开
f ( x → + Δ x → ) = f ( x → ) + [ ∇ f ( x → ) ] T Δ x → + 1 2 Δ x → T ∇ 2 f ( x → ) Δ x → + . . . f(\overrightarrow{x}+\Delta{\overrightarrow{x}})=f(\overrightarrow{x})+[\nabla f(\overrightarrow{x})]^{T}\Delta \overrightarrow{x}+\frac{1}{2}\Delta \overrightarrow{x}^{T} \nabla ^{2}f(\overrightarrow{x})\Delta \overrightarrow{x}+... f(x +Δx )=f(x )+[∇f(x )]TΔx +21Δx T∇2f(x )Δx +...
其中, [ ∇ f ( x → ) ] T = [ ∂ f ∂ x 1 , ∂ f ∂ x 2 . . . ∂ f ∂ x d ] , ∇ 2 f ( x → ) = [ ∂ 2 f ∂ x 1 ∂ x 1 ∂ 2 f ∂ x 2 ∂ x 1 ⋯ ∂ 2 f ∂ x d ∂ x 1 ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x 1 ∂ x d ⋯ ⋯ ∂ 2 f ∂ x d ∂ x d ] [\nabla f(\overrightarrow{x})]^{T}=[\frac{\partial f}{\partial x_{1}},\frac{\partial f}{\partial x_{2}}...\frac{\partial f}{\partial x_{d}}],\nabla ^{2}f(\overrightarrow{x})=\begin{bmatrix} \frac{\partial^{2} f}{\partial x _{1}\partial x _{1}}& \frac{\partial^{2} f}{\partial x_{2}\partial x _{1}} &\cdots &\frac{\partial^{2} f}{\partial x_{d}\partial x_{1}} \\ \vdots & \vdots &\ddots &\vdots \\ \frac{\partial^{2} f}{\partial x _{1}\partial x _{d}}& \cdots & \cdots & \frac{\partial^{2} f}{\partial x _{d}\partial x _{d}} \end{bmatrix} [∇f(x )]T=[∂x1∂f,∂x2∂f...∂xd∂f],∇2f(x )=⎣⎢⎢⎡∂x1∂x1∂2f⋮∂x1∂xd∂2f∂x2∂x1∂2f⋮⋯⋯⋱⋯∂xd∂x1∂2f⋮∂xd∂xd∂2f⎦⎥⎥⎤
极值点(极小值)
∇ f ( x → ) = 0 , ∇ 2 f ( x → ) > 0 \nabla f(\overrightarrow{x})=0 ,\nabla ^{2}f(\overrightarrow{x})>0 ∇f(x )=0,∇2f(x )>0
其中,满足 ∇ 2 f ( x → ) > 0 \nabla ^{2}f(\overrightarrow{x})>0 ∇2f(x )>0的对称矩阵称为正定矩阵,充要条件为特征值大于零或者各阶主子式大于零
原理
求极小值,为保证: f ( x → + Δ x → ) − f ( x → ) = [ ∇ f ( x → ) ] T Δ x → < 0 f(\overrightarrow{x}+\Delta{\overrightarrow{x}})-f(\overrightarrow{x})=[\nabla f(\overrightarrow{x})]^{T}\Delta \overrightarrow{x}<0 f(x +Δx )−f(x )=[∇f(x )]TΔx <0,取: Δ x → = − α ∇ f ( x → ) \Delta{\overrightarrow{x}}=-\alpha \nabla f(\overrightarrow{x}) Δx =−α∇f(x )
为保证泰勒展开在领域内成立的条件,取: Δ x → = − α ∇ f ( x → ) \Delta{\overrightarrow{x}}=-\alpha \nabla f(\overrightarrow{x}) Δx =−α∇f(x )
步骤
取初始值 x i , i = ( 1 , 2 , ⋯ , n ) x_{i},i=(1,2,\cdots,n) xi,i=(1,2,⋯,n)求 ∇ f ( x i → ) , Δ x i → = − ∇ f ( x i → ) \nabla f(\overrightarrow{x_{i}}),\Delta{\overrightarrow{x_{i}}}=-\nabla f(\overrightarrow{x_{i}}) ∇f(xi ),Δxi =−∇f(xi )取 x i + 1 → = x i → + α Δ x → \overrightarrow{x_{i+1}}=\overrightarrow{x_{i}}+\alpha \Delta{\overrightarrow{x}} xi+1 =xi +αΔx 计算 ∥ f ( x i + 1 → ) − f ( x i → ) ∥ 2 2 ≤ ε \parallel f(\overrightarrow{x_{i+1}})-f(\overrightarrow{x_{i}})\parallel_{2}^{2}\leq\varepsilon ∥f(xi+1 )−f(xi )∥22≤ε,若不等式成立则停止,否则 i = i + 1 i=i+1 i=i+1,重复2,3,4步骤
有n个数据对 { x i → , y i } \{\overrightarrow{x_{i}},y_{i}\} {xi ,yi},其中 x i → \overrightarrow{x_{i}} xi 是行向量( i = 1 , 2 , 3 ⋯ , d i=1,2,3\cdots,d i=1,2,3⋯,d)
构造常数项 { β 0 , β 1 , ⋯ β d } \{\beta _{0},\beta _{1},\cdots\beta _{d} \} {β0,β1,⋯βd};误差计算公式为: ε = y i − ( x i → β → + β 0 ) , i = 1 , 2 , 3 , ⋯ , d \varepsilon =y_{i}-(\overrightarrow{x_{i}}\overrightarrow{\beta}+\beta _{0}),i=1,2,3,\cdots,d ε=yi−(xi β +β0),i=1,2,3,⋯,d;其中 β → = { β 1 , β 2 , ⋯ β d } T \overrightarrow{\beta }=\{\beta _{1},\beta _{2},\cdots\beta _{d} \}^{T} β ={β1,β2,⋯βd}T
损失函数为:
J ( β → , β 0 ) = 1 n ∑ i = 1 n ε 2 = 1 n ∑ i = 1 n ( y i − ( x i → β → + β 0 ) ) 2 = 1 n ∑ i = 1 n ( y i − ( ∑ j = 1 d x i j β j + β 0 ) ) 2 J(\overrightarrow{\beta},\beta_{0})=\frac{1}{n}\sum_{i=1}^{n}\varepsilon^{2}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-(\overrightarrow{x_{i}}\overrightarrow{\beta}+\beta _{0}))^{2} =\frac{1}{n}\sum_{i=1}^{n}(y_{i}-(\sum_{j=1}^{d}x_{ij}\beta _{j} +\beta _{0}))^{2} J(β ,β0)=n1∑i=1nε2=n1∑i=1n(yi−(xi β +β0))2=n1∑i=1n(yi−(∑j=1dxijβj+β0))2
令 x i 0 = 1 ( i = 1 , 2 , ⋯ , n ) , β → = { β 0 , β 1 , ⋯ β d } T x_{i0}=1(i=1,2,\cdots,n),\overrightarrow{\beta }=\{\beta _{0},\beta _{1},\cdots\beta _{d} \}^{T} xi0=1(i=1,2,⋯,n),β ={β0,β1,⋯βd}T,则可以写成 J ( β → ) = 1 n ( ∑ i = 1 n ( y i − x i → β → ) ) 2 J(\overrightarrow{\beta})=\frac{1}{n}(\sum_{i=1}^{n}(y_{i}-\overrightarrow{x_{i}}\overrightarrow{\beta}))^{2} J(β )=n1(i=1∑n(yi−xi β ))2
进一步,令 X = [ 1 x 11 ⋯ x 1 d ⋮ ⋮ ⋱ 1 x n 1 ⋯ x n d ] X=\begin{bmatrix}1 & x_{11} &\cdots&x_{1d}\\\vdots& \vdots& \ddots& \\1& x_{n1}&\cdots&x_{nd}\end{bmatrix} X=⎣⎢⎡1⋮1x11⋮xn1⋯⋱⋯x1dxnd⎦⎥⎤, Y = [ y 1 , y 2 , ⋯ , y n ] T Y=\begin{bmatrix}y_{1},y_{2},\cdots,y_{n}\\\end{bmatrix}^{T} Y=[y1,y2,⋯,yn]T, B = [ β 0 , β 1 , ⋯ , β d ] T B=[\beta_{0},\beta_{1},\cdots,\beta_{d}]^{T} B=[β0,β1,⋯,βd]T
J ( B ) = 1 n ( Y − X B ) T ( Y − X B ) J(B)=\frac{1}{n}(Y-XB)^{T}(Y-XB) J(B)=n1(Y−XB)T(Y−XB)
利用梯度下降法求解
参考matrix cookbook,对矩阵进行展开求导J ( B ) = 1 n ( Y − X B ) T ( Y − X B ) = 1 n ( Y T Y − Y T X B − B T X T Y + B T X T X B ) J(B)=\frac{1}{n}(Y-XB)^{T}(Y-XB)=\frac{1}{n}(Y^{T}Y-Y^{T}XB-B^{T}X^{T}Y+B^{T}X^{T}XB) J(B)=n1(Y−XB)T(Y−XB)=n1(YTY−YTXB−BTXTY+BTXTXB)
∇ J ( B ) = 1 n ( − X T Y − X T Y + 2 X T X B ) = 1 n ( − 2 X T Y + 2 X T X B ) \nabla J(B)=\frac{1}{n}(-X^{T}Y-X^{T}Y+2X^{T}XB)=\frac{1}{n}(-2X^{T}Y+2X^{T}XB) ∇J(B)=n1(−XTY−XTY+2XTXB)=n1(−2XTY+2XTXB)
令梯度为零,则有: − 2 X T Y + 2 X T X B = 0 , B = ( X T X ) − 1 X T Y -2X^{T}Y+2X^{T}XB=0,B=(X^{T}X)^{-1}X^{T}Y −2XTY+2XTXB=0,B=(XTX)−1XTY
取 B i + 1 = B i − α ∇ J ( B i ) B_{i+1}={B_{i}}-\alpha \nabla J(B_{i}) Bi+1=Bi−α∇J(Bi)计算 ∥ J ( B i + 1 ) − J ( B i ) ∥ 2 2 ≤ ε \parallel J(B_{i+1})-J(B_{i})\parallel_{2}^{2}\leq\varepsilon ∥J(Bi+1)−J(Bi)∥22≤ε,若不等式成立则停止,否则 i = i + 1 i=i+1 i=i+1,重复1,2,3附注:几类特殊函数的梯度公式
∇ ( b T X ) = b \nabla (b^{T}X)=b ∇(bTX)=b ∇ ( X T b ) = b \nabla (X^{T}b)=b ∇(XTb)=b ∇ ( X T X ) = 2 X \nabla (X^{T}X)=2X ∇(XTX)=2X ∇ ( X T A X ) = 2 A X \nabla (X^{T}AX)=2AX ∇(XTAX)=2AX(其中A为对称矩阵)伪代码
network = TwoLayerNet(...) optimizer = SGD() for i in range(10000): ... x_batch, t_batch = get_mini_batch(...) # mini-batch grads = network.gradient(x_batch, t_batch) params = network.params optimizer.update(params, grads) ...