GMMとEMアルゴリズム
前回GMM(Gaussian Mixture Model)の形について取り上げましたが(GMM導入 - 情報関連の備忘録)、今回はパラメーターの推定方法について取り上げます。 EMアルゴリズムを用いてパラメーターの最適化を行います。
まずは、EMアルゴリズムの一般的な話から
EM algorithm
Let's consider that an observation variables and a latent variable . is composed of and n is the number of data. Then we assume that is generated from a distribution where is a set of the model parameters.
From Jensen's ineqnality, the log likelihood is rewritten as follows;
\begin{eqnarray} \log p ({\bf X} | {\bf \theta}) &=& \log \int p({\bf X}, {\bf Z} | {\bf \theta}) d{\bf Z}\\ &=& \log \int q({\bf Z}) \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &\geq& \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z} \end{eqnarray}
The eqality hold true when . The last term is the lower bound. Here, the lower bound is defined as .
- In E step, we evaluate . Thus we calculate a posterior of latent variables .
- In M step, we substituted for the lower bound to maximize it when is fixed. Next, we estiamte parameters to maximize the lower bound.
- evaluating the log likelihood and continue updating until the log likelihood converges.
EM algorithm for GMM
GMMにおける具体的なstepを見ていきます。
E step of GMM
In E step of Gaussian mixture, from the above discussion \begin{eqnarray} q^{*}({\bf Z}) &=& p({\bf Z} | {\bf X};{\bf \theta})\\ &=& \prod_{n} p({\bf z_n} | {\bf x_n} ; {\bf \theta_n})\\ &=& \prod_{n} \prod_k \gamma_{nk}^{z_{nk}} \end{eqnarray} where the n-th row of is , the n-th row of is .
Thus we should compute responsibility
, that is the matrix .
M step of GMM
Next we maximize the lower bound with respect to the pareameters . From the above , the lower bound is given by \begin{eqnarray} {\it L}({\bf X}|{\bf \theta}) &=& \sum_{\bf Z} q^{*}({\bf Z}) \log p({\bf X}, {\bf Z};{\bf \theta}) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z}) \\ &=& \sum_{\bf Z} q^{*}({\bf Z}) \sum_n \log p({\bf x}_n, {\bf z}_n; \theta) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z})\\ &=& \sum_{\bf Z} q^{*}({\bf Z}) \sum_n \sum_k z_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z})\\ &=& \sum_n \sum_k \sum_{\bf Z} q^{*}({\bf Z}) z_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) - \sum_{\bf Z} q^*({\bf Z}) \log q^{*}({\bf Z}) \end{eqnarray}
At first, we calculate . \begin{eqnarray} \sum_{{\bf Z}} q^{*}({\bf Z}) z_{nk} &=& \sum_{{\bf Z}} \prod_{n'} \prod_{k'} \gamma_{n'k'}^{z_N'k'} z_{nk} \\ &=& \gamma_{nk} \end{eqnarray}
Then, we pay attention a part of depending on \begin{eqnarray} {\it L}({\bf X}|{\bf \theta}) &=& \sum_n \sum_k \gamma_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) + {\rm const}\\ &=& \sum_n \sum_k \gamma_{nk} (\log \pi_k - \frac{1}{2} ({\bf x}_n - {\bf\mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf\mu}_k) +\frac{1}{2} \log |{\bf\Lambda}_k| )+ {\rm const} \end{eqnarray}
Estimate new
By the constraint of , we maximize the lower bound bound with respect using Lagrange multiplier , as follows. \begin{eqnarray} {\tilde {\it L}}({\bf X}|{\bf \theta}) &=& {\it L}({\bf X}|{\bf \theta}) + \lambda (\sum_k \pi_k -1)\\ \frac{\partial {\tilde {\it L}}({\bf X}|{\bf \theta})}{\partial \pi_k} &=& 0 \quad({\rm for \, all}\, k)\\ \frac{\partial {\tilde {\it L}}({\bf X}|{\bf \theta})}{\partial \lambda} &=& 0 \end{eqnarray}
Then we compute the above eqations.
\begin{eqnarray}
\frac{\partial}{\partial \pi_k}{\sum_n \sum_{k'} \gamma_{nk'} \log \pi_{k'} + \lambda \sum_{k'} \pi_{k'} }&=& 0
\to \pi_k = \frac{\sum_n \gamma_{nk}}{\lambda}\\
\frac{\partial}{\partial \lambda} \lambda (\sum_k \pi_k -1)&=& 0
\to \sum_k \pi_k -1 = 0
\end{eqnarray}
On substituting,
\begin{eqnarray}
\sum_k \pi_k &=& \sum_k \frac{\sum_n \gamma_{nk}}{\lambda}
\to \lambda = \sum_k \sum_n \gamma_{nk}
\end{eqnarray}
Therefore, new is give by
\begin{eqnarray}
\pi_k^{*} = \frac{\sum_n \gamma_{nk}}{\sum_k \sum_n \gamma_{nk}}
\end{eqnarray}
Estimate new
we maximize the lower bound with respect to . \begin{eqnarray} \frac{\partial {\it L}({\bf X}|{\bf \theta})}{\partial {\bf \mu}_k} &=& 0 \end{eqnarray} On computing this as follows; \begin{eqnarray} \sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \mu}_k}\frac{1}{2} ({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf \mu}_k) = 0 \to \sum_n \gamma_{nk} ({\bf x}_n - {\bf \mu_k}) = 0 \end{eqnarray} Therefore, new is give by \begin{eqnarray} {\bf \mu}_k^{*} &=& \frac{\sum_n \gamma_{nk}{\bf x}_n}{\sum_n \gamma_{nk}} \end{eqnarray}
Estimate new
We maximize the lower bound with respect to such as
.
We then compute this as follows;
\begin{eqnarray}
\sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \Lambda}_k}
{-\frac{1}{2}({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf \mu}_k)+\frac{1}{2} \log |{\bf\Lambda}_k| }&=& 0\\
\sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \Lambda}_k}
{-tr[({\bf x}_n - {\bf \mu}_k)({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k]+\log |{\bf\Lambda}_k| }&=& 0 \\
\sum_n \gamma_{nk}[({\bf x}_n - {\bf \mu}_k)({\bf x}_n
- {\bf \mu}_k)^T]^T - \sum_n \gamma_{nk} {\bf \Lambda}^{-1 \, T} &=& 0
\end{eqnarray}
Therefore, new is given by
\begin{eqnarray}
{{\bf \Lambda}^{*}_k}^{-1} &=& \frac{1}{\sum_n \gamma_{nk}} \sum_n \gamma_{nk}({\bf x}_n - {\bf \mu}_k)({\bf x}_n
- {\bf \mu}_k)^T
\end{eqnarray}
After is updated with ,
it is rewritten such as
\begin{eqnarray}
{{\bf \Lambda}^{*}_k}^{-1} &=& \frac{\sum_n \gamma_{nk} {\bf x}_n {\bf x}_n^T }{\sum_n \gamma_{nk}} - {\bf \mu}_k {\bf \mu}_k^T
\end{eqnarray}
Appendix: EM algorithm
せっかくなので、EMアルゴリズムの一般的な話をもう少し詳しく考えます。 別の方向から考えていきます。 \begin{eqnarray} \log p ({\bf X} | {\bf \theta}) &=& \log \int p({\bf X}, {\bf Z} | {\bf \theta}) d{\bf Z}\\ &=& \log \int q({\bf Z}) \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &=& \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z} - \int q({\bf Z}) \log \frac{p({\bf Z} | {\bf X}, {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &=& {\it L}(q, {\bf \theta}) + KL(q||p) \end{eqnarray}
KLダイバージェンスは であり、の時にになる。 つまり、を固定した時に対数尤度を最大化するにはとすれば良い。これはE-step に当たる。
次に、この分布を与えた状態で対数尤度を先ほど固定していたパラメーターで最大化する。だいたいの場合はで微分すればうまくいくはず。 これがM-stepにあたる。
この手順を繰り返すことで確実に尤度が上昇する方向にパラメーターを更新できます。 ただし、注意として、EM-stepは計算がそれなりに重くてlocal miximumにはまりがちです。
今回はEMについて扱いましたが、ベイズ的にしたモデルについても次で扱います。 GMMと変分ベイズ - 情報関連の備忘録
参考
Machine Learning | The MIT Pressまとまっているので、すでに知識があれば読みやすい。pdfもネットに落ちてます。