GMMとEMアルゴリズム

前回GMM(Gaussian Mixture Model)の形について取り上げましたが(GMM導入 - 情報関連の備忘録)、今回はパラメーターの推定方法について取り上げます。 EMアルゴリズムを用いてパラメーターの最適化を行います。

まずは、EMアルゴリズムの一般的な話から

EM algorithm

Let's consider that an observation variables  {\bf X} and a latent variable  {\bf Z}.  {\bf X} is composed of  {\bf x_n} and n is the number of data. Then we assume that  {\bf X} is generated from a distribution  p ({\bf X} | {\bf \theta}) where  {\bf \theta} is a set of the model parameters.

From Jensen's ineqnality, the log likelihood is rewritten as follows;

\begin{eqnarray} \log p ({\bf X} | {\bf \theta}) &=& \log \int p({\bf X}, {\bf Z} | {\bf \theta}) d{\bf Z}\\ &=& \log \int q({\bf Z}) \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &\geq& \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z} \end{eqnarray}

The eqality hold true when  q^{*}({\bf Z}) = p({\bf Z} | {\bf X};{\bf \theta}). The last term  \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf X})}d{\bf Z} is the lower bound. Here, the lower bound is defined as  {\it L}({\bf X}|{\bf \theta}) :=  \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z}|{\bf \theta})}{q({\bf X})}d{\bf Z} .

  • In E step, we evaluate  q^{*}({\bf Z}) = p({\bf Z} | {\bf X};{\bf \theta}). Thus we calculate a posterior of latent variables  {\bf Z}.
  • In M step, we substituted  q^{*}({\bf Z}) for the lower bound to maximize it when  \theta is fixed. Next, we estiamte parameters  {\bf \theta} to maximize the lower bound.
  • evaluating the log likelihood  \log p{\bf (X | \theta)} and continue updating until the log likelihood converges.

EM algorithm for GMM

GMMにおける具体的なstepを見ていきます。

E step of GMM

In E step of Gaussian mixture, from the above discussion \begin{eqnarray} q^{*}({\bf Z}) &=& p({\bf Z} | {\bf X};{\bf \theta})\\ &=& \prod_{n} p({\bf z_n} | {\bf x_n} ; {\bf \theta_n})\\ &=& \prod_{n} \prod_k \gamma_{nk}^{z_{nk}} \end{eqnarray} where the n-th row of  {\bf X} is  {\bf x}_n^t, the n-th row of  {\bf Z} is  {\bf z}_n^t.

Thus we should compute responsibility, that is the matrix  \gamma_{nk}.

M step of GMM

Next we maximize the lower bound with respect to the pareameters  {\bf\theta}. From the above  q^{*}({\bf Z}), the lower bound  {\it L}({\bf X}|{\bf \theta}) is given by \begin{eqnarray} {\it L}({\bf X}|{\bf \theta}) &=& \sum_{\bf Z} q^{*}({\bf Z}) \log p({\bf X}, {\bf Z};{\bf \theta}) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z}) \\ &=& \sum_{\bf Z} q^{*}({\bf Z}) \sum_n \log p({\bf x}_n, {\bf z}_n; \theta) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z})\\ &=& \sum_{\bf Z} q^{*}({\bf Z}) \sum_n \sum_k z_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) - \sum_{\bf Z} q^{*}({\bf Z}) \log q^{*}({\bf Z})\\ &=& \sum_n \sum_k \sum_{\bf Z} q^{*}({\bf Z}) z_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) - \sum_{\bf Z} q^*({\bf Z}) \log q^{*}({\bf Z}) \end{eqnarray}

At first, we calculate  \sum_{{\bf Z}} q^{*}({\bf Z}) z_{nk}. \begin{eqnarray} \sum_{{\bf Z}} q^{*}({\bf Z}) z_{nk} &=& \sum_{{\bf Z}} \prod_{n'} \prod_{k'} \gamma_{n'k'}^{z_N'k'} z_{nk} \\ &=& \gamma_{nk} \end{eqnarray}

Then, we pay attention a part of  {\it L}({\bf X}|{\bf \theta}) depending on  \theta \begin{eqnarray} {\it L}({\bf X}|{\bf \theta}) &=& \sum_n \sum_k \gamma_{nk} (\log \pi_k + \log {\it N}({\bf x_n}|{\bf \mu}_k, {\bf \Lambda}_k^{-1}) + {\rm const}\\ &=& \sum_n \sum_k \gamma_{nk} (\log \pi_k - \frac{1}{2} ({\bf x}_n - {\bf\mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf\mu}_k) +\frac{1}{2} \log |{\bf\Lambda}_k| )+ {\rm const} \end{eqnarray}

Estimate new  \pi

By the constraint of  {\bf \pi}, we maximize the lower bound bound with respect  {\bf\pi} using Lagrange multiplier , as follows. \begin{eqnarray} {\tilde {\it L}}({\bf X}|{\bf \theta}) &=& {\it L}({\bf X}|{\bf \theta}) + \lambda (\sum_k \pi_k -1)\\ \frac{\partial {\tilde {\it L}}({\bf X}|{\bf \theta})}{\partial \pi_k} &=& 0 \quad({\rm for \, all}\, k)\\ \frac{\partial {\tilde {\it L}}({\bf X}|{\bf \theta})}{\partial \lambda} &=& 0 \end{eqnarray}

Then we compute the above eqations. \begin{eqnarray} \frac{\partial}{\partial \pi_k}{\sum_n \sum_{k'} \gamma_{nk'} \log \pi_{k'} + \lambda \sum_{k'} \pi_{k'} }&=& 0 \to \pi_k = \frac{\sum_n \gamma_{nk}}{\lambda}\\ \frac{\partial}{\partial \lambda} \lambda (\sum_k \pi_k -1)&=& 0 \to \sum_k \pi_k -1 = 0 \end{eqnarray} On substituting,
\begin{eqnarray} \sum_k \pi_k &=& \sum_k \frac{\sum_n \gamma_{nk}}{\lambda} \to \lambda = \sum_k \sum_n \gamma_{nk} \end{eqnarray} Therefore, new  \pi_k is give by \begin{eqnarray} \pi_k^{*} = \frac{\sum_n \gamma_{nk}}{\sum_k \sum_n \gamma_{nk}} \end{eqnarray}

Estimate new  {\bf \mu}

we maximize the lower bound with respect to  {\bf\mu}. \begin{eqnarray} \frac{\partial {\it L}({\bf X}|{\bf \theta})}{\partial {\bf \mu}_k} &=& 0 \end{eqnarray} On computing this as follows; \begin{eqnarray} \sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \mu}_k}\frac{1}{2} ({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf \mu}_k) = 0 \to \sum_n \gamma_{nk} ({\bf x}_n - {\bf \mu_k}) = 0 \end{eqnarray} Therefore, new  \mu_k is give by \begin{eqnarray} {\bf \mu}_k^{*} &=& \frac{\sum_n \gamma_{nk}{\bf x}_n}{\sum_n \gamma_{nk}} \end{eqnarray}

Estimate new  {\bf \Lambda}

We maximize the lower bound with respect to  {\bf \lambda} such as   \frac{\partial {\it L}({\bf X}|{\bf \theta})}{\partial {\bf \Lambda}_k} = 0. We then compute this as follows; \begin{eqnarray} \sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \Lambda}_k} {-\frac{1}{2}({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k({\bf x}_n - {\bf \mu}_k)+\frac{1}{2} \log |{\bf\Lambda}_k| }&=& 0\\ \sum_n \gamma_{nk} \frac{\partial}{\partial {\bf \Lambda}_k} {-tr[({\bf x}_n - {\bf \mu}_k)({\bf x}_n - {\bf \mu}_k)^T {\bf\Lambda}_k]+\log |{\bf\Lambda}_k| }&=& 0 \\ \sum_n \gamma_{nk}[({\bf x}_n - {\bf \mu}_k)({\bf x}_n - {\bf \mu}_k)^T]^T - \sum_n \gamma_{nk} {\bf \Lambda}^{-1 \, T} &=& 0 \end{eqnarray} Therefore, new  {\bf \Lambda_k} is given by \begin{eqnarray} {{\bf \Lambda}^{*}_k}^{-1} &=& \frac{1}{\sum_n \gamma_{nk}} \sum_n \gamma_{nk}({\bf x}_n - {\bf \mu}_k)({\bf x}_n - {\bf \mu}_k)^T \end{eqnarray} After  {\bf \mu}_k is updated with  {\bf \mu}_k = \frac{\sum_n \gamma_{nk}{\bf x}_n}{\sum_n \gamma_{nk}}, it is rewritten such as
\begin{eqnarray} {{\bf \Lambda}^{*}_k}^{-1} &=& \frac{\sum_n \gamma_{nk} {\bf x}_n {\bf x}_n^T }{\sum_n \gamma_{nk}} - {\bf \mu}_k {\bf \mu}_k^T \end{eqnarray}

Appendix: EM algorithm

せっかくなので、EMアルゴリズムの一般的な話をもう少し詳しく考えます。 別の方向から考えていきます。 \begin{eqnarray} \log p ({\bf X} | {\bf \theta}) &=& \log \int p({\bf X}, {\bf Z} | {\bf \theta}) d{\bf Z}\\ &=& \log \int q({\bf Z}) \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &=& \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z} | {\bf \theta})}{q({\bf Z})}d{\bf Z} - \int q({\bf Z}) \log \frac{p({\bf Z} | {\bf X}, {\bf \theta})}{q({\bf Z})}d{\bf Z}\\ &=& {\it L}(q, {\bf \theta}) + KL(q||p) \end{eqnarray}

KLダイバージェンス  KL(q||p)  KL(q||p) \ge 0であり、 q({\bf Z}) = p({\bf Z} |{\bf X},  {\bf \theta})の時に  KL(q||p) = 0になる。 つまり、  {\bf \theta}を固定した時に対数尤度 \log p ({\bf X} | {\bf \theta})を最大化するには  KL(q||p) = 0 \longleftrightarrow q^{*}({\bf Z}) = p({\bf Z} |{\bf X},  {\bf \theta})とすれば良い。これはE-step に当たる。

次に、この分布を与えた状態 \log p ({\bf X} | {\bf \theta}) = {\it L}(q^{*}, {\bf \theta})で対数尤度を先ほど固定していたパラメーター  {\bf \theta}で最大化する。だいたいの場合は  {\bf \theta}微分すればうまくいくはず。 これがM-stepにあたる。

この手順を繰り返すことで確実に尤度が上昇する方向にパラメーターを更新できます。 ただし、注意として、EM-stepは計算がそれなりに重くてlocal miximumにはまりがちです。

今回はEMについて扱いましたが、ベイズ的にしたモデルについても次で扱います。 GMMと変分ベイズ - 情報関連の備忘録

参考

Machine Learning | The MIT Pressまとまっているので、すでに知識があれば読みやすい。pdfもネットに落ちてます。