前回GMM(Gaussian Mixture Model)の形について取り上げましたが(GMM導入 - 情報関連の備忘録)、今回はパラメーターの推定方法について取り上げます。 EMアルゴリズムを用いてパラメーターの最適化を行います。

EM algorithm
EM algorithm for GMM
- E step of GMM
- M step of GMM
Appendix: EM algorithm
参考

まずは、EMアルゴリズムの一般的な話から

EM algorithm

Let's consider that an observation variables ${\bf X}$ and a latent variable ${\bf Z}$ . ${\bf X}$ is composed of ${\bf x_n}$ and n is the number of data. Then we assume that ${\bf X}$ is generated from a distribution $p ({\bf X} | {\bf \theta})$ where ${\bf \theta}$ is a set of the model parameters.

From Jensen's ineqnality, the log likelihood is rewritten as follows;

\begin{eqnarray} \log p ({\bf X} | {\bf \theta}) &=& \log \int p({\bf X}, {\bf Z} | {\bf \theta}) d{\bf Z}\\ &=& \log \int q({\bf Z}) \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &\geq& \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z} \end{eqnarray}

The eqality hold true when $q^{*}({\bf Z}) = p({\bf Z} | {\bf X};{\bf \theta})$ . The last term $\int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf X})}d{\bf Z}$ is the lower bound. Here, the lower bound is defined as ${\it L}({\bf X}|{\bf \theta}) := \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z}|{\bf \theta})}{q({\bf X})}d{\bf Z}$ .

In E step, we evaluate $q^{*}({\bf Z}) = p({\bf Z} | {\bf X};{\bf \theta})$ . Thus we calculate a posterior of latent variables ${\bf Z}$ .
In M step, we substituted $q^{*}({\bf Z})$ for the lower bound to maximize it when $\theta$ is fixed. Next, we estiamte parameters ${\bf \theta}$ to maximize the lower bound.
evaluating the log likelihood $\log p{\bf (X | \theta)}$ and continue updating until the log likelihood converges.

EM algorithm for GMM

GMMにおける具体的なstepを見ていきます。

E step of GMM

In E step of Gaussian mixture, from the above discussion \begin{eqnarray} q^{*}({\bf Z}) &=& p({\bf Z} | {\bf X};{\bf \theta})\\ &=& \prod_{n} p({\bf z_n} | {\bf x_n} ; {\bf \theta_n})\\ &=& \prod_{n} \prod_k \gamma_{nk}^{z_{nk}} \end{eqnarray} where the n-th row of ${\bf X}$ is ${\bf x}_n^t$ , the n-th row of ${\bf Z}$ is ${\bf z}_n^t$ .

Thus we should compute responsibility, that is the matrix $\gamma_{nk}$ .

M step of GMM

Next we maximize the lower bound with respect to the pareameters ${\bf\theta}$ . From the above $q^{*}({\bf Z})$ , the lower bound ${\it L}({\bf X}|{\bf \theta})$ is given by \begin{eqnarray} {\it L}({\bf X}|{\bf \theta}) &=& \sum_{\bf Z} q^{*}({\bf Z}) \log p({\bf X}, {\bf Z};{\bf \theta}) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z}) \\ &=& \sum_{\bf Z} q^{*}({\bf Z}) \sum_n \log p({\bf x}_n, {\bf z}_n; \theta) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z})\\ &=& \sum_{\bf Z} q^{*}({\bf Z}) \sum_n \sum_k z_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z})\\ &=& \sum_n \sum_k \sum_{\bf Z} q^{*}({\bf Z}) z_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) - \sum_{\bf Z} q^*({\bf Z}) \log q^{*}({\bf Z}) \end{eqnarray}

At first, we calculate $\sum_{{\bf Z}} q^{*}({\bf Z}) z_{nk}$ . \begin{eqnarray} \sum_{{\bf Z}} q^{*}({\bf Z}) z_{nk} &=& \sum_{{\bf Z}} \prod_{n'} \prod_{k'} \gamma_{n'k'}^{z_N'k'} z_{nk} \\ &=& \gamma_{nk} \end{eqnarray}

Then, we pay attention a part of ${\it L}({\bf X}|{\bf \theta})$ depending on $\theta$ \begin{eqnarray} {\it L}({\bf X}|{\bf \theta}) &=& \sum_n \sum_k \gamma_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) + {\rm const}\\ &=& \sum_n \sum_k \gamma_{nk} (\log \pi_k - \frac{1}{2} ({\bf x}_n - {\bf\mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf\mu}_k) +\frac{1}{2} \log |{\bf\Lambda}_k| )+ {\rm const} \end{eqnarray}

Estimate new $\pi$

By the constraint of ${\bf \pi}$ , we maximize the lower bound bound with respect ${\bf\pi}$ using Lagrange multiplier , as follows. \begin{eqnarray} {\tilde {\it L}}({\bf X}|{\bf \theta}) &=& {\it L}({\bf X}|{\bf \theta}) + \lambda (\sum_k \pi_k -1)\\ \frac{\partial {\tilde {\it L}}({\bf X}|{\bf \theta})}{\partial \pi_k} &=& 0 \quad({\rm for \, all}\, k)\\ \frac{\partial {\tilde {\it L}}({\bf X}|{\bf \theta})}{\partial \lambda} &=& 0 \end{eqnarray}

Then we compute the above eqations. \begin{eqnarray} \frac{\partial}{\partial \pi_k}{\sum_n \sum_{k'} \gamma_{nk'} \log \pi_{k'} + \lambda \sum_{k'} \pi_{k'} }&=& 0 \to \pi_k = \frac{\sum_n \gamma_{nk}}{\lambda}\\ \frac{\partial}{\partial \lambda} \lambda (\sum_k \pi_k -1)&=& 0 \to \sum_k \pi_k -1 = 0 \end{eqnarray} On substituting,
\begin{eqnarray} \sum_k \pi_k &=& \sum_k \frac{\sum_n \gamma_{nk}}{\lambda} \to \lambda = \sum_k \sum_n \gamma_{nk} \end{eqnarray} Therefore, new $\pi_k$ is give by \begin{eqnarray} \pi_k^{*} = \frac{\sum_n \gamma_{nk}}{\sum_k \sum_n \gamma_{nk}} \end{eqnarray}

Estimate new ${\bf \mu}$

we maximize the lower bound with respect to ${\bf\mu}$ . \begin{eqnarray} \frac{\partial {\it L}({\bf X}|{\bf \theta})}{\partial {\bf \mu}_k} &=& 0 \end{eqnarray} On computing this as follows; \begin{eqnarray} \sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \mu}_k}\frac{1}{2} ({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf \mu}_k) = 0 \to \sum_n \gamma_{nk} ({\bf x}_n - {\bf \mu_k}) = 0 \end{eqnarray} Therefore, new $\mu_k$ is give by \begin{eqnarray} {\bf \mu}_k^{*} &=& \frac{\sum_n \gamma_{nk}{\bf x}_n}{\sum_n \gamma_{nk}} \end{eqnarray}

Estimate new ${\bf \Lambda}$

We maximize the lower bound with respect to ${\bf \lambda}$ such as $\frac{\partial {\it L}({\bf X}|{\bf \theta})}{\partial {\bf \Lambda}_k} = 0$ . We then compute this as follows; \begin{eqnarray} \sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \Lambda}_k} {-\frac{1}{2}({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf \mu}_k)+\frac{1}{2} \log |{\bf\Lambda}_k| }&=& 0\\ \sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \Lambda}_k} {-tr[({\bf x}_n - {\bf \mu}_k)({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k]+\log |{\bf\Lambda}_k| }&=& 0 \\ \sum_n \gamma_{nk}[({\bf x}_n - {\bf \mu}_k)({\bf x}_n - {\bf \mu}_k)^T]^T - \sum_n \gamma_{nk} {\bf \Lambda}^{-1 \, T} &=& 0 \end{eqnarray} Therefore, new ${\bf \Lambda_k}$ is given by \begin{eqnarray} {{\bf \Lambda}^{*}_k}^{-1} &=& \frac{1}{\sum_n \gamma_{nk}} \sum_n \gamma_{nk}({\bf x}_n - {\bf \mu}_k)({\bf x}_n - {\bf \mu}_k)^T \end{eqnarray} After ${\bf \mu}_k$ is updated with ${\bf \mu}_k = \frac{\sum_n \gamma_{nk}{\bf x}_n}{\sum_n \gamma_{nk}}$ , it is rewritten such as
\begin{eqnarray} {{\bf \Lambda}^{*}_k}^{-1} &=& \frac{\sum_n \gamma_{nk} {\bf x}_n {\bf x}_n^T }{\sum_n \gamma_{nk}} - {\bf \mu}_k {\bf \mu}_k^T \end{eqnarray}

Appendix: EM algorithm

せっかくなので、EMアルゴリズムの一般的な話をもう少し詳しく考えます。別の方向から考えていきます。 \begin{eqnarray} \log p ({\bf X} | {\bf \theta}) &=& \log \int p({\bf X}, {\bf Z} | {\bf \theta}) d{\bf Z}\\ &=& \log \int q({\bf Z}) \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &=& \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z} - \int q({\bf Z}) \log \frac{p({\bf Z} | {\bf X}, {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &=& {\it L}(q, {\bf \theta}) + KL(q||p) \end{eqnarray}

KLダイバージェンス $KL(q||p)$ は $KL(q||p) \ge 0$ であり、 $q({\bf Z}) = p({\bf Z} |{\bf X}, {\bf \theta})$ の時に $KL(q||p) = 0$ になる。つまり、 ${\bf \theta}$ を固定した時に対数尤度 $\log p ({\bf X} | {\bf \theta})$ を最大化するには $KL(q||p) = 0 \longleftrightarrow q^{*}({\bf Z}) = p({\bf Z} |{\bf X}, {\bf \theta})$ とすれば良い。これはE-step に当たる。

次に、この分布を与えた状態 $\log p ({\bf X} | {\bf \theta}) = {\it L}(q^{*}, {\bf \theta})$ で対数尤度を先ほど固定していたパラメーター ${\bf \theta}$ で最大化する。だいたいの場合は ${\bf \theta}$ で微分すればうまくいくはず。これがM-stepにあたる。

この手順を繰り返すことで確実に尤度が上昇する方向にパラメーターを更新できます。ただし、注意として、EM-stepは計算がそれなりに重くてlocal miximumにはまりがちです。

今回はEMについて扱いましたが、ベイズ的にしたモデルについても次で扱います。 GMMと変分ベイズ - 情報関連の備忘録

参考

Machine Learning | The MIT Pressまとまっているので、すでに知識があれば読みやすい。pdfもネットに落ちてます。

計算機関連作業メモ

作業メモ

GMMとEMアルゴリズム