两种算法模型:
生成式P(x, Y)
与nlg的生成概念不同 Y可能是隐变量y=(smeo),可能是回归值,可能是类别. 优势:能力强大;缺陷:成本高 x->y, y->x, x,y 可以用来采样 朴素贝叶斯、混合高斯模型GMM、隐马尔科夫模型(HMM)、贝叶斯网络 Sigmoid Belief Networks 、深度信念网络(DBN)判别式P(Y|X) 优势:目标导向,成本低;缺陷:只能解决单一问题 x->y 线性回归/逻辑回归(Logistic Regression)、K近邻(KNN)、感知机、神经网络(NN)、支持向量机(SVM)、决策树、最大熵模型(maximum entropy model, MaxEnt)、高斯过程(Gaussian Process)、条件随机场(CRF)、boosting方法
马尔可夫链(Markov link):一种特殊的随机过程,其随机性只与当前状态有关,与过往已发生的状态和将来可能发生的状态都无关 隐马尔可夫链(hidden Markov method):用来描述一个变化状态是隐藏的,且是离散的马尔可夫过程(特殊随机过程)。 隐马尔可夫模型(Hidden Markov Model,HMM): 统计模型,描述一个含有隐含未知参数的马尔可夫过程。其难点是从可观察的参数中确定该过程的隐含参数。然后利用这些参数来作进一步的分析,例如模式识别。
统计模型: 是一组数学模型,它包含了一组关于样本数据的假设。统计模型通常以相当理想化的形式表示数据生成过程。 马尔可夫过程(Markov Process): 一类随机过程。马尔可夫过程是研究离散事件动态系统状态空间的重要方法,它的数学基础是随机过程理论。一个例子
假设我手里有三个不同的骰子。第一个骰子6个面(称这个骰子为D6),每个面(1,2,3,4,5,6)出现的概率是1/6。第二个骰子是个四面体(称这个骰子为D4),每个面(1,2,3,4)出现的概率是1/4。第三个骰子有八个面(称这个骰子为D8),每个面(1,2,3,4,5,6,7,8)出现的概率是1/8。
我们开始掷骰子: 1.)我们先从三个骰子里挑一个,挑到每一个骰子的概率都是1/3。 2.)然后我们掷骰子,得到一个数字,1,2,3,4,5,6,7,8中的一个。不停的重复上述过程,我们会得到一串数字,每个数字都是1,2,3,4,5,6,7,8中的一个。
最后得到这么一串可见的数字称之为可见状态链 (掷骰子10次):1 6 3 5 2 7 3 5 2 4 还有一个隐含状态链,是你用的哪种骰子的序列D6 D8 D8 D6 D4 D8 D6 D6 D4 D8
Evaluation概率计算,正向、反向算法 给定𝜆,求𝑝(𝑂|𝜆) Learning学习,EM算法 已知一个观测序列O,用MLE找出使O概率最大的𝜆 𝜆_𝑀𝐿𝐸=𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝑂|𝜆) Decoding解码,viterbi算法 已知观测序列和参数lambda,求解概率最大的隐藏状态序列 𝐻=𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝐻|𝑂,𝜆)
Evaluation, Given λ \lambda λ, 求 P ( O ∣ λ ) P(O|\lambda) P(O∣λ), 已知参数 λ \lambda λ,评估一个已经发生的观测序列 O O O的概率,用以判断我们的模型参数是不是准(知道骰子有几种(隐含状态数量),每种骰子是什么(转换概率),根据掷骰子掷出的结果(可见状态链),我想知道掷出这个结果的概率)Learning, λ \lambda λ, λ M L E = a r g m a x P ( O ∣ λ ) \lambda_{MLE} = arg maxP(O|\lambda) λMLE=argmaxP(O∣λ) 已知一个观测序列事实 O O O,找出一组参数 λ \lambda λ使得其概率最大, 用 E M EM EM算法(知道骰子有几种(隐含状态数量),每种骰子是什么(转换概率),根据掷骰子掷出的结果(可见状态链),我想知道每次掷出来的都是哪种骰子(隐含状态链))Decoding, H ^ = a r g m a x P ( H ∣ O ; λ ) \hat{H}= arg maxP(H|O;\lambda) H^=argmaxP(H∣O;λ), 已知观察序列和参数,求(反编)哪一串隐序列使得这个事实发生的概率最大,Viterbi算法(动态规划)穷举法(舍弃)(知道骰子有几种(隐含状态数量),不知道每种骰子是什么(转换概率),观测到很多次掷骰子的结果(可见状态链),我想反推出每种骰子是什么(转换概率)。)(参考别人,自己进行细化,个别地方进行解释说明)
针对上述第一个问题,进行公式求解。 给定 λ = ( π , A , B ) \lambda=(\pi,A,B) λ=(π,A,B) 求 P ( O ∣ λ ) P(O|\lambda) P(O∣λ)
P ( O ∣ λ ) = ∑ S H P ( O , S ∣ λ ) = ∑ S H P ( O ∣ S ; λ ) P ( S ∣ λ ) (1) P(O|\lambda) = \sum_{S}^H P(O,S|\lambda) =\sum_{S}^H P(O|S;\lambda)P(S|\lambda) \tag{1} P(O∣λ)=S∑HP(O,S∣λ)=S∑HP(O∣S;λ)P(S∣λ)(1) 将 S = s 1 s 2 . . . s n S = s_1s_2...s_n S=s1s2...sn,将其带入公式: P ( S ∣ λ ) = P ( s 1 s 2 . . . s T ∣ λ ) = P ( s T ∣ s 1 s 2 . . . s T − 1 ; λ ) P ( s 1 s 2 . . . s T − 1 ; λ ) P(S|\lambda)=P(s_1s_2...s_T|\lambda)=\color{green}P(s_T|s_1s_2...s_{T-1};\lambda)\color{red}P(s_1s_2...s_{T-1};\lambda) P(S∣λ)=P(s1s2...sT∣λ)=P(sT∣s1s2...sT−1;λ)P(s1s2...sT−1;λ) 计算至T-1,进行迭代: P ( s 1 s 2 . . . s T − 1 ; λ ) = P ( s T − 1 ∣ s 1 s 2 . . . s T − 2 ; λ ) P ( s 1 s 2 . . . s T − 2 ; λ ) P(s_1s_2...s_{T-1};\lambda)=\color{green}P(s_{T-1}|s_1s_2...s_{T-2};\lambda)\color{red}P(s_1s_2...s_{T-2};\lambda) P(s1s2...sT−1;λ)=P(sT−1∣s1s2...sT−2;λ)P(s1s2...sT−2;λ) P ( s 1 s 2 . . . s T − 2 ; λ ) = P ( s T − 2 ∣ s 1 s 2 . . . s T − 3 ; λ ) P ( s 1 s 2 . . . s T − 3 ; λ ) P(s_1s_2...s_{T-2};\lambda)=\color{green}P(s_{T-2}|s_1s_2...s_{T-3};\lambda)\color{red}P(s_1s_2...s_{T-3};\lambda) P(s1s2...sT−2;λ)=P(sT−2∣s1s2...sT−3;λ)P(s1s2...sT−3;λ) . . . ... ... P ( s 2 ; λ ) = P ( s 2 ∣ s 1 ; λ ) P ( s 1 ; λ ) P(s_2;\lambda)=\color{green}P(s_2|s_1;\lambda)\color{red}P(s_1;\lambda) P(s2;λ)=P(s2∣s1;λ)P(s1;λ) 又由齐次markov性假设(当前状态至于其前一个状态有关,与观测序列无关): P ( s t + 1 ∣ s 1 s 2 . . . s t ; o 1 o 2 . . . o t ) = P ( s t + 1 ∣ s t ) P(s_{t+1}|s_1s_2...s_t;o_1o_2...o_t)=P(s_{t+1}|s_t) P(st+1∣s1s2...st;o1o2...ot)=P(st+1∣st),故上式将后式逐渐向前式中进行带入,有
P ( S ∣ λ ) = P ( s 1 s 2 . . . s T ∣ λ ) = P ( s T ∣ s 1 s 2 . . . s T − 1 ; λ ) P ( s T − 1 ∣ s 1 s 2 . . . s T − 2 ; λ ) P ( s T − 2 ∣ s 1 s 2 . . . s T − 3 ; λ ) . . . P ( s 2 ∣ s 1 ; λ ) = P ( s T ∣ s T − 1 ; λ ) P ( s T − 1 ∣ s T − 2 ; λ ) P ( s T − 2 ∣ s T − 3 ; λ ) . . . P ( s 2 ∣ s 1 ; λ ) = ∏ t = 2 T p ( s t ∣ s t − 1 , λ ) P ( s 2 ∣ s 1 ; λ ) ; s t ∈ H P(S|\lambda)=P(s_1s_2...s_T|\lambda)=\color{green}P(s_T|s_1s_2...s_{T-1};\lambda) P(s_{T-1}|s_1s_2...s_{T-2};\lambda) P(s_{T-2}|s_1s_2...s_{T-3};\lambda)... \color{red}P(s_2|s_1;\lambda)\\ \color{black}= P(s_T|s_{T-1};\lambda) P(s_{T-1}|s_{T-2};\lambda) P(s_{T-2}|s_{T-3};\lambda)... \color{red}P(s_2|s_1;\lambda)\\ \color{block}=\prod_{t=2}^{T} p(s_t|s_{t-1}, \lambda)\color{red}P(s_2|s_1;\lambda);\; s_t \in H P(S∣λ)=P(s1s2...sT∣λ)=P(sT∣s1s2...sT−1;λ)P(sT−1∣s1s2...sT−2;λ)P(sT−2∣s1s2...sT−3;λ)...P(s2∣s1;λ)=P(sT∣sT−1;λ)P(sT−1∣sT−2;λ)P(sT−2∣sT−3;λ)...P(s2∣s1;λ)=t=2∏Tp(st∣st−1,λ)P(s2∣s1;λ);st∈H
= ∏ t = 2 T p ( s t ∣ s t − 1 , λ ) P ( s 1 ; λ ) = π ( s 1 ) ∏ t = 2 T a s t − 1 s t , s t ∈ H (2) =\prod_{t=2}^{T} p(s_t|s_{t-1}, \lambda)\color{red}P(s_1;\lambda) =\pi(s_1)\prod_{t=2}^{T} a_{s_{t-1}s_{t}}, \;\; s_t \in H \tag{2} =t=2∏Tp(st∣st−1,λ)P(s1;λ)=π(s1)t=2∏Tast−1st,st∈H(2) 根据定义, P ( O ∣ S ; λ ) P(O|S;\lambda) P(O∣S;λ)为给定参数 λ \lambda λ,隐藏状态S时观测变量值,可直接得到: P ( O ∣ S ; λ ) = ∏ t = 1 T b s t → o t , s t ∈ H , o t ∈ R (3) P(O|S;\lambda)=\prod_{t=1}^T b_{s_t \to o_t}, \; \; s_t \in H , o_t \in R \tag{3} P(O∣S;λ)=t=1∏Tbst→ot,st∈H,ot∈R(3)
所以 P ( O ∣ λ ) = ∑ s 1 H ∑ s 2 H . . . ∑ s T H ⏟ O=N的T次方 π ( s 1 ) ∏ t = 1 T - 1 a s t s t + 1 ∏ t = 1 T b s t → o t o ( T N T ) (4) P(O|\lambda)=\underbrace{\sum_{s_1}^H\sum_{s_2}^H...\sum_{s_T}^H}_{\text{O=N的T次方}} \pi(s_1) \prod_{t=1}^{T-1} a_{s_ts_{t+1}}\prod_{t=1}^T b_{s_t \to o_t} \tag{4}\\ o(TN^T) P(O∣λ)=O=N的T次方 s1∑Hs2∑H...sT∑Hπ(s1)t=1∏T-1astst+1t=1∏Tbst→oto(TNT)(4)
前向概率: 给定隐马尔可夫模型 λ \lambda λ,定义到t时刻部分观测序列为 0 1 , o 2 , . . . , o t 0_1, o_2, ..., o_t 01,o2,...,ot,且状态为 q i q_i qi的概率为前向概率,记为 α t ( i ) = P ( o 1 . . . o t , s t = h i ∣ λ ) (5) \alpha_{t}(i)=P(o_1...o_t,s_t=h_i|\lambda) \tag{5} αt(i)=P(o1...ot,st=hi∣λ)(5) 则可递推求得前向概率 α t ( i ) \alpha_{t}(i) αt(i)及观测序列概率 P ( O ∣ λ ) P(O|\lambda) P(O∣λ)。即给定了模型参数,给定时刻t的状态,此为概率进行计算。 α 1 ( i ) = P ( o 1 , s 1 = h i ∣ λ ) = P ( o 1 ∣ s 1 = h i ) P ( s 1 = h i ) = b h i → o 1 π ( s 1 = h i ) α T ( i ) = P ( O , s T = h i ∣ λ ) (6) \tag{6} \alpha_{1}(i)=P(o_1,s_1=h_i|\lambda)=P(o_1|s_1=h_i)P(s_1=h_i)\\ =b_{h_i\to o_1}\pi(s_1=h_i) \\ \alpha_{T}(i)=P(O,s_T=h_i|\lambda) α1(i)=P(o1,s1=hi∣λ)=P(o1∣s1=hi)P(s1=hi)=bhi→o1π(s1=hi)αT(i)=P(O,sT=hi∣λ)(6)
P ( O ∣ λ ) = ∑ i = 1 N P ( O , S T = h i ∣ λ ) = ∑ i = 1 N α T ( i ) (7) \tag{7} P(O|\lambda) = \sum_{i=1}^N P(O,S_T=h_i|\lambda)\\ =\sum_{i=1}^N \alpha_{T}(i) P(O∣λ)=i=1∑NP(O,ST=hi∣λ)=i=1∑NαT(i)(7) 展开 α t + 1 ( j ) = P ( o 1 . . . o t o t + 1 , s t + 1 = h j ∣ λ ) = ∑ i = 1 N P ( o 1 . . . o t o t + 1 , s t = h i s t + 1 = h j ∣ λ ) = ∑ i = 1 N P ( o t + 1 ∣ o 1 . . . o t , s t = h i s t + 1 = h j ; λ ) P ( o 1 . . . o t , s t = h i s t + 1 = h j ; λ ) = ∑ i = 1 N P ( o t + 1 ∣ s t + 1 = h j ) P ( o 1 . . . o t , s t = h i s t + 1 = h j ; λ ) = ∑ i = 1 N P ( o t + 1 ∣ s t + 1 = h j ) P ( s t + 1 = h j ∣ s t = h i ; λ ) P ( o 1 . . . o t , s t = h i ; λ ) = ∑ i = 1 N b h j → o t + 1 a i j α t ( i ) (8) \tag{8} \alpha_{t+1}(j)=P(o_1...o_to_{t+1},s_{t+1}=h_j|\lambda) \\ =\sum_{i=1}^N P(o_1...o_to_{t+1},s_t=h_i s_{t+1}=h_j|\lambda) \\ =\sum_{i=1}^N P(o_{t+1}|o_1...o_t,s_t=h_i s_{t+1}=h_j;\lambda) P(o_1...o_t,s_t=h_i s_{t+1}=h_j;\lambda) \\ =\sum_{i=1}^N P(o_{t+1}|s_{t+1}=h_j) P(o_1...o_t,s_t=h_i s_{t+1}=h_j;\lambda) \\ =\sum_{i=1}^N P(o_{t+1}|s_{t+1}=h_j) P(s_{t+1}=h_j|s_t=h_i;\lambda) P(o_1...o_t,s_t=h_i;\lambda) \\ =\sum_{i=1}^N b_{h_j \to o_{t+1}} a_{ij} \alpha_{t}(i) αt+1(j)=P(o1...otot+1,st+1=hj∣λ)=i=1∑NP(o1...otot+1,st=hist+1=hj∣λ)=i=1∑NP(ot+1∣o1...ot,st=hist+1=hj;λ)P(o1...ot,st=hist+1=hj;λ)=i=1∑NP(ot+1∣st+1=hj)P(o1...ot,st=hist+1=hj;λ)=i=1∑NP(ot+1∣st+1=hj)P(st+1=hj∣st=hi;λ)P(o1...ot,st=hi;λ)=i=1∑Nbhj→ot+1aijαt(i)(8)
给定 λ = ( π , A , B ) \lambda=(\pi,A,B) λ=(π,A,B) 求 P ( O ∣ λ ) P(O|\lambda) P(O∣λ)
若记 β t ( i ) = P ( o t + 1 . . . o T ∣ s t = h i ; λ ) (9) \tag{9} \beta_{t}(i) = P(o_{t+1}...o_T|s_t=h_i;\lambda) βt(i)=P(ot+1...oT∣st=hi;λ)(9) 则有 β 1 ( i ) = P ( o 2 . . . o T ∣ s 1 = h i ; λ ) (10) \tag{10} \beta_{1}(i) = P(o_2...o_T|s_1=h_i;\lambda) β1(i)=P(o2...oT∣s1=hi;λ)(10) 同时 β T − 1 ( i ) = P ( o T ∣ s T − 1 = h i ; λ ) = ∑ j = 1 N P ( o T , s T = h j ∣ s T − 1 = h i ; λ ) = ∑ j = 1 N P ( o T ∣ s T = h j , s T − 1 = h i ; λ ) P ( s T = h j ∣ s T − 1 = h i ; λ ) = ∑ j = 1 N P ( o T ∣ s T = h j , s T − 1 = h i ; λ ) a i j = ∑ j = 1 N b j → o T a i j (11) \tag{11} \beta_{T-1}(i)=P(o_T|s_{T-1}=h_i;\lambda) \\ =\sum_{j=1}^N P(o_T,s_T=h_j|s_{T-1}=h_i;\lambda) \\ =\sum_{j=1}^N P(o_T|s_T=h_j,s_{T-1}=h_i;\lambda) P(s_T=h_j|s_{T-1}=h_i;\lambda) \\ =\sum_{j=1}^N P(o_T|s_T=h_j,s_{T-1}=h_i;\lambda) a_{ij} \\ =\sum_{j=1}^N b_{j\to o_T} a_{ij} βT−1(i)=P(oT∣sT−1=hi;λ)=j=1∑NP(oT,sT=hj∣sT−1=hi;λ)=j=1∑NP(oT∣sT=hj,sT−1=hi;λ)P(sT=hj∣sT−1=hi;λ)=j=1∑NP(oT∣sT=hj,sT−1=hi;λ)aij=j=1∑Nbj→oTaij(11) 现在我们列出递推式 β t \beta_t βt 与 β t + 1 \beta_{t+1} βt+1的关系 β t ( i ) = P ( o t + 1 . . . o T ∣ s t = h i ; λ ) = ∑ j = 1 N P ( o t + 1 . . . o T , s t + 1 = h j ∣ s t = h i ; λ ) = ∑ j = 1 N P ( o t + 1 ∣ o t + 2 . . . o T , s t + 1 = h j , s t = h i ; λ ) P ( o t + 2 . . . o T , s t + 1 = h j ∣ s t = h i ; λ ) = ∑ j = 1 N b j → o t + 1 P ( o t + 2 . . . o T ∣ s t + 1 = h j , s t = h i ) P ( s t + 1 = h j ∣ s t = h i ) 这 一 步 两 个 状 态 作 为 条 件 不 能 直 接 把 s t 去 掉 。 但 考 虑 到 a − > b − > c , 知 道 了 b 就 中 断 了 a 与 c 的 联 系 , a c 相 当 于 互 相 独 立 了 。 则 可 以 去 掉 s t = ∑ j = 1 N b j → o t + 1 β t + 1 ( j ) a i j (12) \tag{12} \beta_{t}(i)=P(o_{t+1}...o_T|s_t=h_i;\lambda) \\ =\sum_{j=1}^N P(o_{t+1}...o_T,s_{t+1}=h_j|s_t=h_i;\lambda) \\ =\sum_{j=1}^N P(o_{t+1}|o_{t+2}...o_T,s_{t+1}=h_j,s_t=h_i;\lambda) P(o_{t+2}...o_T,s_{t+1}=h_j|s_t=h_i;\lambda) \\ =\sum_{j=1}^N b_{j \to o_{t+1}} P(o_{t+2}...o_T|s_{t+1}=h_j,s_t=h_i) P(s_{t+1}=h_j|s_t=h_i) \\ 这一步两个状态作为条件不能直接把s_{t}去掉。\\但考虑到a->b->c,知道了b就中断了a与c的联系,ac相当于互相独立了。则可以去掉s_t\\ =\sum_{j=1}^N b_{j \to o_{t+1}} \beta_{t+1}(j) a_{ij} βt(i)=P(ot+1...oT∣st=hi;λ)=j=1∑NP(ot+1...oT,st+1=hj∣st=hi;λ)=j=1∑NP(ot+1∣ot+2...oT,st+1=hj,st=hi;λ)P(ot+2...oT,st+1=hj∣st=hi;λ)=j=1∑Nbj→ot+1P(ot+2...oT∣st+1=hj,st=hi)P(st+1=hj∣st=hi)这一步两个状态作为条件不能直接把st去掉。但考虑到a−>b−>c,知道了b就中断了a与c的联系,ac相当于互相独立了。则可以去掉st=j=1∑Nbj→ot+1βt+1(j)aij(12) 而所求为 P ( O ∣ λ ) = P ( o 1 . . . o T ∣ λ ) = ∑ i = 1 N P ( o 1 . . . o T , s 1 = h i ; λ ) = ∑ i = 1 N P ( o 1 . . . o T ∣ s 1 = h i ; λ ) P ( s 1 = h i ) = ∑ i = 1 N P ( o 1 . . . o T ∣ s 1 = h i ; λ ) π ( s 1 ) = ∑ i = 1 N P ( o 1 ∣ o 2 . . . o T , s 1 = h i ; λ ) P ( o 2 . . . o T , s 1 = h i ; λ ) π ( s 1 ) = ∑ i = 1 N P ( o 1 ∣ s 1 = h i ) β 1 ( i ) π ( s 1 = h i ) = ∑ i = 1 N b s 1 = h i → o 1 β 1 ( i ) π ( s 1 = h i ) (13) \tag{13} P(O|\lambda)=P(o_1...o_T|\lambda) \\ =\sum_{i=1}^N P(o_1...o_T,s_1=h_i;\lambda) \\ =\sum_{i=1}^N P(o_1...o_T|s_1=h_i;\lambda)P(s_1=h_i) \\ =\sum_{i=1}^N P(o_1...o_T|s_1=h_i;\lambda)\pi(s_1) \\ =\sum_{i=1}^N P(o_1|o_2...o_T,s_1=h_i;\lambda)P(o_2...o_T,s_1=h_i;\lambda)\pi(s_1) \\ =\sum_{i=1}^N P(o_1 | s_1=h_i)\beta_1(i) \pi(s_1=h_i) \\ =\sum_{i=1}^N b_{s_1=h_i \to o_1} \beta_1(i)\pi(s_1=h_i) P(O∣λ)=P(o1...oT∣λ)=i=1∑NP(o1...oT,s1=hi;λ)=i=1∑NP(o1...oT∣s1=hi;λ)P(s1=hi)=i=1∑NP(o1...oT∣s1=hi;λ)π(s1)=i=1∑NP(o1∣o2...oT,s1=hi;λ)P(o2...oT,s1=hi;λ)π(s1)=i=1∑NP(o1∣s1=hi)β1(i)π(s1=hi)=i=1∑Nbs1=hi→o1β1(i)π(s1=hi)(13)
EM算法 θ t + 1 = argmax θ ∫ z l o g P ( X , Z ∣ θ ) P ( Z ∣ X , θ t ) d z (14) \tag{14} \theta^{t+1}=\underset{\theta}{\operatorname{argmax}} \int_{z} log P(X,Z|\theta) P(Z|X,\theta^t)dz θt+1=θargmax∫zlogP(X,Z∣θ)P(Z∣X,θt)dz(14) 对应到HMM的参数 λ = ( π , A , B ) \lambda=(\pi, A, B) λ=(π,A,B) λ t + 1 = argmax λ ∑ S l o g P ( O , S ∣ λ ) P ( S ∣ O , λ t ) (15) \tag{15} \lambda^{t+1}=\underset{\lambda}{\operatorname{argmax}} \sum_{S} log P(O,S|\lambda) P(S|O,\lambda^t) λt+1=λargmaxS∑logP(O,S∣λ)P(S∣O,λt)(15) 又因为 P ( S ∣ O , λ t ) = P ( S , O ∣ λ t ) P ( O , λ t ) P(S|O,\lambda^t)=\frac{P(S,O|\lambda^t)}{P(O,\lambda^t)} P(S∣O,λt)=P(O,λt)P(S,O∣λt) 中分母 P ( O , λ t ) P(O,\lambda^t) P(O,λt)是个定值,对 ( 15 ) (15) (15)不影响,所以目标可变为 λ t + 1 = argmax λ ∑ S l o g P ( O , S ∣ λ ) P ( S , O ∣ λ t ) (16) \tag{16} \lambda^{t+1}=\underset{\lambda}{\operatorname{argmax}} \sum_{S} log P(O,S|\lambda) P(S,O|\lambda^t) λt+1=λargmaxS∑logP(O,S∣λ)P(S,O∣λt)(16) 所以我们可以定义目标函数为: f ( λ , λ t ) = ∑ S log P ( O , S ∣ λ ) P ( S , O ∣ λ t ) 代 入 ( 4 ) 式 P ( O , S ∣ λ ) = π ( s 1 ) ∏ t = 1 T - 1 a s t s t + 1 ∏ t = 1 T b s t → o t = ∑ S [ ( l o g π ( s 1 ) + ∑ t = 1 T - 1 l o g a s t s t + 1 + ∑ t = 1 T l o g b s t → o t ) P ( S , O ∣ λ t ) ] (17) \tag{17} f(\lambda, \lambda^t)=\sum_{S} \log P(O,S|\lambda) P(S,O|\lambda^t) \\ 代入(4)式 P(O,S|\lambda)=\pi(s_1) \prod_{t=1}^{T-1} a_{s_ts_{t+1}}\prod_{t=1}^T b_{s_t \to o_t} \\ =\sum_{S} [(log \pi(s_1) + \bcancel{\sum_{t=1}^{T-1} log a_{s_ts_{t+1}}}+ \bcancel{\sum_{t=1}^T log b_{s_t \to o_t}})P(S,O|\lambda^t)] f(λ,λt)=S∑logP(O,S∣λ)P(S,O∣λt)代入(4)式 P(O,S∣λ)=π(s1)t=1∏T-1astst+1t=1∏Tbst→ot=S∑[(logπ(s1)+t=1∑T-1logastst+1 +t=1∑Tlogbst→ot )P(S,O∣λt)](17) 为了简便计算,我们先只考虑 π ( s 1 ) \pi(s_1) π(s1), 公式 ( 17 ) (17) (17)可进一步简化为: ∑ S [ l o g π ( s 1 ) P ( S , O ∣ λ t ) ] = ∑ s 1 . . . ∑ s T [ l o g π ( s 1 ) P ( O , s 1 . . . s T ∣ λ t ) ] s 2 . . . s T 与 π 没 关 系 , 所 以 ∑ s 2... N 相 当 于 求 P ( O , S ) 的 边 缘 概 率 : P ( O , s 1 ) = ∑ s 2... N P ( O , s 1 . . . s N ) 故 = ∑ s 1 [ l o g π ( s 1 ) P ( O , s 1 ∣ λ t ) ] = ∑ h i , i = 1 N [ l o g π ( s 1 = h i ) P ( O , s 1 = h i ∣ λ t ) ] \sum_{S} [log \pi(s_1)P(S,O|\lambda^t)] \\ =\sum_{s_1}...\sum_{s_T} [log \pi(s_1)P(O,s_1...s_T|\lambda^t)] \\ s_2...s_T与\pi没关系,所以\sum_{s_{2...N}}相当于求P(O,S)的边缘概率:P(O,s_1)=\sum_{s_{2...N}}P(O,s_1...s_N)\\ 故=\sum_{s_1} [log \pi(s_1)P(O,s_1|\lambda^t)] \\ =\sum_{h_i,i=1}^N [log \pi(s_1=h_i)P(O,s_1=h_i|\lambda^t)] S∑[logπ(s1)P(S,O∣λt)]=s1∑...sT∑[logπ(s1)P(O,s1...sT∣λt)]s2...sT与π没关系,所以s2...N∑相当于求P(O,S)的边缘概率:P(O,s1)=s2...N∑P(O,s1...sN)故=s1∑[logπ(s1)P(O,s1∣λt)]=hi,i=1∑N[logπ(s1=hi)P(O,s1=hi∣λt)] 问题转化为约束条件下的极值问题: { ∑ i = 1 N [ l o g π ( s 1 = h i ) P ( O , s 1 = h i ∣ λ t ) ] s . t ∑ i = 1 N π ( s 1 = h i ) = 1 \begin{cases} \sum_{i=1}^N [log \pi(s_1=h_i)P(O,s_1=h_i|\lambda^t)] \\ s.t \;\; \sum_{i=1}^N \pi(s_1=h_i)=1 \end{cases} {∑i=1N[logπ(s1=hi)P(O,s1=hi∣λt)]s.t∑i=1Nπ(s1=hi)=1 利用Lagrange乘子法 L ( π , η ) = ∑ i = 1 N [ l o g π ( s 1 = h i ) P ( O , s 1 = h i ∣ λ t ) ] + η ( ∑ i = 1 N π ( s 1 = h i ) − 1 ) (19) \tag{19} L(\pi, \eta)=\sum_{i=1}^N [log \pi(s_1=h_i)P(O,s_1=h_i|\lambda^t)] + \eta(\sum_{i=1}^N \pi(s_1=h_i)-1) L(π,η)=i=1∑N[logπ(s1=hi)P(O,s1=hi∣λt)]+η(i=1∑Nπ(s1=hi)−1)(19)
∂ L ∂ π i = 1 π i P ( O , s 1 = h i ∣ λ t ) + η = 0 (20) \tag{20} \frac{\partial L}{\partial \pi_i}=\frac{1}{\pi_i}P(O,s_1=h_i|\lambda^t) + \eta =0 ∂πi∂L=πi1P(O,s1=hi∣λt)+η=0(20)
两边乘以 π i \pi_i πi 再把所有 π i \pi_i πi进行求和得: ∑ i = 1 N P ( O , s 1 = h i ∣ λ t ) + η = 0 η = − ∑ i = 1 N P ( O , s 1 = h i ∣ λ t ) \sum_{i=1}^N P(O,s_1=h_i|\lambda^t) + \eta =0 \eta = -\sum_{i=1}^N P(O,s_1=h_i|\lambda^t) i=1∑NP(O,s1=hi∣λt)+η=0η=−i=1∑NP(O,s1=hi∣λt)代入(20)得 π i = P ( O , s 1 = h i ∣ λ t ) ∑ i = 1 N P ( O , s 1 = h i ∣ λ t ) \pi_i =\frac{P(O,s_1=h_i|\lambda^t)}{\sum_{i=1}^N P(O,s_1=h_i|\lambda^t)} πi=∑i=1NP(O,s1=hi∣λt)P(O,s1=hi∣λt) 最终得 π i t + 1 = P ( O , s 1 = h i ∣ λ t ) P ( O ∣ λ t ) \pi_i^{t+1} =\frac{P(O,s_1=h_i|\lambda^t)}{P(O|\lambda^t)} πit+1=P(O∣λt)P(O,s1=hi∣λt) PS:这里如果真实求值,则依然使用前向后向算法,只是把第一个隐藏状态做了限制。
∑ I ( ∑ t = 1 T − 1 log a i t i t + 1 ) p ( O , I ∣ λ − ) = ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ − ) ∑ j = 1 N a i j = 1 a i j = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ − ) ∑ t = 1 T − 1 p ( O , i t = i ∣ λ − ) \sum_I(\sum_{t=1}^{T-1}\log a_{i_ti_{t+1}})p(O,I|\lambda^-)=\sum_{i=1}^N\sum_{j=1}^N\sum_{t=1}^{T-1}a_{ij}p(O,i_t=i,i_{t+1}=j|\lambda^-)\\ \sum_{j=1}^N a_{ij}=1\\ a_{ij}=\frac {\sum_{t=1}^{T-1}p(O,i_t=i,i_{t+1}=j|\lambda^-)}{\sum_{t=1}^{T-1}p(O,i_t=i|\lambda^-)} I∑(t=1∑T−1logaitit+1)p(O,I∣λ−)=i=1∑Nj=1∑Nt=1∑T−1aijp(O,it=i,it+1=j∣λ−)j=1∑Naij=1aij=∑t=1T−1p(O,it=i∣λ−)∑t=1T−1p(O,it=i,it+1=j∣λ−)
b j ( k ) = ∑ t = 1 T p ( O , i t = j ∣ λ − ) I ( o t = v k ) ∑ t = 1 T p ( O , i t = i ∣ λ − ) b_j(k)=\frac {\sum_{t=1}^{T}p(O,i_t=j|\lambda^-)I(o_t=v_k)}{\sum_{t=1}^{T}p(O,i_t=i|\lambda^-)} bj(k)=∑t=1Tp(O,it=i∣λ−)∑t=1Tp(O,it=j∣λ−)I(ot=vk)
发射概率b,每个隐藏状态对应K个值,对b_k求导时,只有o_t=v_k时才不等于0。
hmm的em算法是无监督的,如果有标注,那么直接使用统计即可得出参数
Viterbi算法
类似动态规划思想,求出每个子序列的最大值进而逐步得到整个序列发生的最大值。它相比穷举法对时间复杂度有很大的改进。
对于一个已经发生的观察序列 O = o 1 o 2 . . . o T O=o_1o_2...o_T O=o1o2...oT, 要找到某一隐序列 s 1 s 2 . . . s T , s i ∈ H s_1s_2...s_T, s_i \in H s1s2...sT,si∈H 使发生的概率最大
穷举法,每一个 s i s_i si都可以有 N N N种可能,共有 N T N^T NT种序列,根据参数,算出每一种序列的发观事实的概率,取最大的。Viterbi 算法, o 1 o_1 o1找出最大概率对应的 s 1 s_1 s1,固定! s 1 → s 2 = h i → o 2 s_1 \to s_2=h_i \to o_2 s1→s2=hi→o2选一条最大的固定, s 1 s 2 → s 3 = h i → o 3 s_1s_2\to s_3=h_i \to o_3 s1s2→s3=hi→o3选一条最大的,这样就有 T ∗ N T*N T∗N的计算复杂度。P ( S ∣ O , λ ) S = s 1 s 2 s 3 . . . s T O = o 1 o 2 o 3 . . . o T (21) \tag{21} P(S|O,\lambda) \\ S=s_1s_2s_3...s_T \\ O=o_1o_2o_3...o_T P(S∣O,λ)S=s1s2s3...sTO=o1o2o3...oT(21) 首先定义 δ t ( i ) = max s 1 s 2 . . . s t P ( s t = h i , . . . s 2 s 1 , o 1 o 2 . . . o t ∣ λ ) \delta_{t} (i) = \underset{s_1s_2...s_{t}}{\operatorname{max}} P(s_t=h_i,...s_2s_1,o_1o_2...o_t|\lambda) δt(i)=s1s2...stmaxP(st=hi,...s2s1,o1o2...ot∣λ) 表示 t t t时刻,隐状态 s t = h i s_t=h_i st=hi,为最符合已发生事实的概率标记,
则有 δ t + 1 ( i ) = max s 1 . . . s t + 1 p ( s t + 1 = h i , s 1 . . . s t , o 1 . . o t + 1 ) = max s 1 . . . s t + 1 p ( o t + 1 ∣ s 1 . . . s t + 1 , o 1 . . o t ) p ( s 1 . . . s t + 1 , o 1 . . o t ) = max s 1 . . . s t + 1 p ( o t + 1 ∣ s t + 1 ) p ( s t + 1 ∣ s t , s 1 . . s t − 1 , o 1 . . o t ) p ( s 1 . . . s t , o 1 . . o t ) = max s 1 . . . s t + 1 p ( o t + 1 ∣ s t + 1 ) p ( s t + 1 ∣ s t ) p ( s 1 . . . s t , o 1 . . o t ) = max j = 1 N θ t ( j ) a j i b s i o t + 1 \delta_{t+1} (i) =\max_{s_1...s_{t+1}}p(s_{t+1}=h_i,s_1...s_t,o_1..o_{t+1})\\ =\max_{s_1...s_{t+1}}p(o_{t+1}|s_1...s_{t+1},o_1..o_t)p(s_1...s_{t+1},o_1..o_t)\\ =\max_{s_1...s_{t+1}}p(o_{t+1}|s_{t+1})p(s_{t+1}|s_t,s_1..s_{t-1},o_1..o_t)p(s_1...s_t,o_1..o_t)\\ =\max_{s_1...s_{t+1}}p(o_{t+1}|s_{t+1})p(s_{t+1}|s_t)p(s_1...s_t,o_1..o_t)\\ =\max_{j=1}^N \theta_t(j)a_{ji}b_{s_io_{t+1}} δt+1(i)=s1...st+1maxp(st+1=hi,s1...st,o1..ot+1)=s1...st+1maxp(ot+1∣s1...st+1,o1..ot)p(s1...st+1,o1..ot)=s1...st+1maxp(ot+1∣st+1)p(st+1∣st,s1..st−1,o1..ot)p(s1...st,o1..ot)=s1...st+1maxp(ot+1∣st+1)p(st+1∣st)p(s1...st,o1..ot)=j=1maxNθt(j)ajibsiot+1
δ 1 ( i ) = max s 1 P ( s 1 = h i , o 1 ) \delta_{1}(i)={\max_{s_1}}P(s_1=h_i,o_1) δ1(i)=s1maxP(s1=hi,o1) 令 φ ( t ) = arg max 1 < = i < = N δ t ( i ) = i \varphi(t)=\arg \max_{1<=i<=N} \delta_t (i)=i φ(t)=arg1<=i<=Nmaxδt(i)=i 所以有 φ ( 1 ) φ ( 2 ) . . . φ ( T ) = i n d e x ( S ) = max i n d e x ( S ) P ( S ∣ O , λ ) \varphi(1)\varphi(2)...\varphi(T) = index(S) = \underset{index(S)}{\operatorname{max}}P(S|O,\lambda) φ(1)φ(2)...φ(T)=index(S)=index(S)maxP(S∣O,λ)
计算步骤由 1 , 2 , . . . T 1,2,...T 1,2,...T即解码的隐状态序列,这种方法也叫Viterbi算法