GMMと変分ベイズ

GMM導入 - 情報関連の備忘録GMMとEMアルゴリズム - 情報関連の備忘録 から引き続きの話題です。

EMアルゴリズムで推定した時のモデル化では、混合されているガウス分布の個数を固定して推定をしました。 しかし、実際には事前に個数が予測できないことの方が多いかと思います。 そこで、クラスターの個数も予測させるようなモデル化とその最適化について取り上げます。実際のコードはまたの機会に。

概要

具体的に数式で書くと、今までのモデルでは、観測変数 {\bf X } と隠れ変数  {\bf Z}についての分布は、パラメーター {\bf \pi, \mu, \Lambda}を用いて、

\begin{eqnarray} p({\bf X | Z, \mu, \Lambda}) &=& \prod_n \prod_k {\it N}({\bf x_n | \mu_k, \Lambda_k^{-1}})^{z_{nk}}\\ p({\bf Z | \pi}) &=& \prod_n \prod_k \pi^{z_k^{nk}} \end{eqnarray}

となっていました。パラメーター {\bf \pi, \mu, \Lambda}に対して、事前分布を考えます。計算を簡単にするために共役事前分布を入れます。(そうしないと変分ベイズではなくサンプリングで扱う必要が出てくる。)  {\bf \pi}にはディリクレ分布、 {\bf \mu, \Lambda}にはガウスウィシャート分布を導入します。 \begin{eqnarray} p{\bf (\pi)} &=& \frac{\Gamma(\sum_k \alpha_{0k})}{\prod_k \Gamma(\alpha_{0k})} \prod_k \pi_k^{\alpha_{0k} -1}\\ p{\bf (\mu, \Lambda)} &=& \prod_k {\it N}{\bf (\mu_k | m_0, (\beta_0\Lambda_k)^{-1})} {\it W}{\bf (\Lambda_k | W_0, \nu_0)} \end{eqnarray}

そのあと、この事前分布のパラメーターを変分ベイズ法で取り扱い最適化していきます。

まずは、変分ベイズのお話から。

About Variational Bayes

Let's consider about a fully Bayesian model, and we suppose all parameters are given prior distributions. The model have latent variables as well as parameters, and let us define  {\bf Z} as the set of all latent variables and parameters. Similarly, we define  {\bf X} as the set of all observed variables. From Jensen's ineqnality, the log likelihood of  {\bf X} is such as \begin{eqnarray} \log p ({\bf X}) &=& \log \int p({\bf X}, {\bf Z}) d{\bf Z}\\ &=& \log \int q({\bf Z}) \frac{p({\bf X}, {\bf Z})}{q({\bf Z})}d{\bf Z}\\ &\geq& \int q({\bf Z}) \log \frac{p({\bf X}, {\bf Z})}{q({\bf Z})}d{\bf Z} = L(q) \end{eqnarray} Let's consider about the condition where we want to maximize  L(q), but can't get such a distribution  q. To solve it, we assume partition the elements of  {\bf Z} into disjoint groups that we denote by  {\bf Z_i}, and decompose distribution over  {\bf Z} as follows \begin{eqnarray} q({\bf Z}) = \prod_i q_i ({\bf Z_i}) \end{eqnarray} Then, we rewrite the lower bound  L(q) such as \begin{eqnarray} L(q) &=& \sum_i { \int q_i({\bf Z_i}) (\int q_{-i}({\bf Z_{-i}})\log p({\bf X, Z})d{\bf Z_{-i}})d{\bf Z_i} - \int q_i({\bf Z_i}) \log q_i({\bf Z_i}) d {\bf Z_i} }\\ &=& \sum_i { \int q_i({\bf Z_i}) <\log p{\bf (X, Z)}>_{q{\bf (Z_{-i})}} d{\bf Z_i} - \int q_i({\bf Z_i}) \log q_i({\bf Z_i}) d {\bf Z_i} } \end{eqnarray} where  {\bf Z_{-i}} means that  {\bf Z} is removed  {\bf Z_{i}} . From the above and Jensen's ineqnality, the lower bound is maximized when \begin{eqnarray} \log q^{*} ({\bf Z_i}) = <\log p{\bf (X, Z)}>_{q{\bf (Z_{-i})}} + {\rm const.} \end{eqnarray} and we update  \log q^{*} ({\bf Z_i}) for all  i until the lower bound converges.

VB for GMM

GMMにおける変分ベイズでは \begin{eqnarray} q({\bf Z, \pi, \mu, \Lambda}) = q({\bf Z}) q({\bf \pi, \mu, \Lambda}) \end{eqnarray} だけを仮定する。この仮定に基づき変分ベイズの分布の更新を行えば対数尤度は上昇していく。 これをEM algorithmにおけるステップとみたようなものと解釈しようとすると以下のようになる。

  • VB-E step \begin{eqnarray} \log q^{*} ({\bf Z}) =< \log p{\bf (X | Z, \mu, \Lambda)} + \log p({\bf Z | \pi}) >_{q{\bf (\pi, \mu, \Lambda)}} + {\rm const.} \end{eqnarray}

  • VB-M step \begin{eqnarray} \log q^{*} ({\bf \pi}) &=& <\log p({\bf Z | \pi})p({\bf \pi})>_{q({\bf Z})} + {\rm const.}\\ \log q^{*} ({\bf \mu, \Lambda}) &=& <\log p{\bf (X|Z,\mu,\Lambda)}p{\bf (\mu, \Lambda)}>_{q({\bf Z})} + {\rm const.} \end{eqnarray}

そして、variational lower bound   L(q)は \begin{eqnarray} L &=& <\log p{\bf (X, Z, \pi, \mu, \Lambda)}>_{q{\bf (Z, \pi, \mu, \Lambda)}} - <\log q{\bf (Z, \pi, \mu, \Lambda)}>_{q{\bf (Z, \pi, \mu, \Lambda)}}\\ &=& <\log p{\bf (X | Z, \mu, \Lambda)}>_{q{\bf (Z, \mu, \Lambda)}} + <\log p{\bf (Z | \pi)}>_{q{\bf (Z, \pi)}} + <\log p{\bf (\pi)}>_{q{\bf (\pi)}}\\ &&+ <\log p {\bf (\mu, \Lambda)}>_{q{\bf (\mu, \Lambda)}} - <\log q{\bf (Z)}>_{q{\bf (Z)}} - <\log q{\bf (\pi)}>_{q{\bf (\pi)}}\\ && - <\log q{\bf (\mu, \Lambda)}>_{q{\bf (\mu, \Lambda)}} \end{eqnarray} でもとめられます。

次に具体的な計算をみていきます。

VB-E step for GMM

\begin{eqnarray} \log q^{*} ({\bf Z}) &=& \sum_n \sum_k z_{nk} {<\log \pi_k>_{q(\pi)} + <\log {\it N} {\bf (x_n|\mu_k, \Lambda_k^{-1})}>_{q{\bf (\mu_k, \Lambda_k)}}}\\ &=&\sum_n \sum_k z_{nk} \log \rho_{nk} + {\bf const.} \end{eqnarray} where we define  D as the dimension of  {\bf x_n}, and \begin{eqnarray} \log \rho_{nk} &=& <\log \pi_k>_{q(\pi)} + \frac{1}{2} <\log |{\bf \Lambda_k}|>_{q({\bf \Lambda_k})}\\ && - \frac{D}{2} \log (2\pi) - \frac{1}{2} <{\bf (x_n - \mu_k)^T \Lambda_k (x_n - \mu_k)}>_{q({\bf \mu_k, \Lambda_k})} \end{eqnarray} Each term is computed such as \begin{eqnarray} <\log \pi_k>_{q(\pi)} &=& \psi(\alpha_k) - \psi (\sum_k \alpha_k)\\ <\log |{\bf \Lambda_k}|>_{q({\bf \Lambda_k})} &=& \sum_{i=1}^D \psi(\frac{\nu_k+1-i}{2}) + D \log2 + \log |{\bf W}_k|\\ <{\bf (x_n - \mu_k)^T \Lambda_k (x_n - \mu_k)}>_{q({\bf \mu_k, \Lambda_k})} &=& D \beta_k^{-1} + \nu_k ({\bf x_n - m_k})^T {\bf W_k} {\bf (x_n - m_k)} \end{eqnarray} where  \psi is a digamma function and  \alpha_k,  \beta_k,  {\bf W_k} are hyperparameters for  \pi_k,  {\bf \mu_k},  {\bf \Lambda}_k. From the above discussion, \begin{eqnarray} q^{*}{\bf (Z)} = \prod_n \prod_k \gamma_{nk}^{z_{nk}} \end{eqnarray} where \begin{eqnarray} \gamma_{nk} = \frac{\rho_{nk}}{\sum_j \rho_{nj}} \end{eqnarray}

VB-M step for GMM

About  q^{*} ({\bf \pi})

We take off the element independent of the latent variables  {\bf Z} from   \log q^{*} ({\bf \pi}). \begin{eqnarray} \log q^{*}({\bf \pi}) &=& \log p({\bf \pi}) + <\log p({\bf Z | \pi})>_{q({\bf Z})} + {\rm const.} \end{eqnarray} Here, we look at each term. \begin{eqnarray} \log p({\bf \pi})&=&\sum_k (\alpha_{0k} - 1)\log \pi_k + {\rm const.}\\ <\log p({\bf Z | \pi})>_{q({\bf Z})} &=& \sum_n \sum_k <z_{nk}>_{q({\bf Z})} \log \pi_k \end{eqnarray} First, we are computing the expectation of  z_{nk} with  q( {\bf Z} ) such as \begin{eqnarray} <z_{nk}>_{q({\bf Z})} &=& \sum_{{\bf Z}} z_{nk} q({\bf Z})\\ &=& \sum_{{\bf Z}} z_{nk} \prod_{n^{‘}} \prod_{k^{‘}} \gamma_{n^{‘}k^{‘}}^{z_{n^{‘}k^{‘}}}\\ &=& \gamma_{nk} \end{eqnarray}

Therefore, \begin{eqnarray} \log q^{*}({\bf \pi})&=&\sum_k (\alpha_{k} - 1) \log \pi_k \end{eqnarray} where \begin{eqnarray} \alpha_k = \alpha_{0k} + \sum_n \gamma_{nk} \end{eqnarray}

About  q^{*} ({\bf \mu, \Lambda})

We take off the element independent of the latent variables  {\bf Z} from  \log q^{*} ({\bf \mu, \Lambda}). \begin{eqnarray} \log q^{*} ({\bf \mu, \Lambda}) &=& \log p({\bf \mu, \Lambda}) + <\log p({\bf X | Z, \mu, \Lambda}) >_{q {\bf (Z)}} + {\rm const.} \end{eqnarray} Here, we look at each term. \begin{eqnarray} \log p(\bf{ \mu, \Lambda}) &=& \sum_{k}{ \log {\it N } {\bf (\mu_k | m_0, (\beta_0 \Lambda_k)^{-1})} + \log {\it W} {\bf (\Lambda_k | W_0, \nu_0)}}\\ <\log p({\bf X | Z, \mu, \Lambda}) >_{q {\bf (Z)}}&=& \sum_k \sum_n <z_{nk}>_{q {\bf (Z)}} \log {\it N} {\bf (x_n | \mu_n, \Lambda^{-1})}\\ &=&\sum_k \sum_n \gamma_{nk} \log {\it N} {\bf (x_n | \mu_n, \Lambda_k^{-1})} \end{eqnarray} Then, we look at about component k. \begin{eqnarray} \log q^{*} ({\bf \mu_k, \Lambda_k}) &=& \log {\it N } {\bf (\mu_k | m_0, (\beta_0 \Lambda_k)^{-1})} + \log {\it W} {\bf (\Lambda_k | W_0, \nu_0)} \\ && + \sum_n \gamma_{nk} \log {\it N} {\bf (x_n | \mu_n, \Lambda_k^{-1})}+ const\\ &=& -\frac{\beta_0}{2} {\bf (\mu_k - m_0)^T \Lambda_k (\mu_k - m_0)} + \frac{1}{2} \log |{\bf \Lambda_k}|\\ &&- \frac{1}{2} Tr {\bf (W_0^{-1} \Lambda_k)} + \frac{\nu_0 - D -1}{2} \log |{\bf \Lambda_k} |\\ && - \frac{1}{2} \sum_n \gamma_{nk} {\bf (x_n - \mu_k)^T \Lambda_k (x_n - \mu_k)} + \frac{1}{2} \sum_n \gamma_{nk} \log|{\bf \Lambda_k}| + {\rm const.} \end{eqnarray}

Using the product rule of probability, we express  q^{*} ({\bf \mu_k, \Lambda_k}) such as  q^{*} ({\bf \mu_k, \Lambda_k}) = q^{*} ({\bf \mu_k| \Lambda_k}) q^{*} ({\bf \Lambda_k}). At first, we only consider terms on the right side which depend on  \mu_k. \begin{eqnarray} \log q^{*} ({\bf \mu_k|\Lambda_k})&=& -\frac{\beta_0}{2} {\bf (\mu_k^T \Lambda_k \mu_k - 2 \mu_k^T \Lambda_k m_0)} - \frac{1}{2} \sum_n \gamma_{nk} {\bf {\mu_k^T \Lambda_k \mu_k - \mu_k^T \Lambda_k x_n}} \\ && + {\rm const.}\\ &=& -\frac{1}{2}{\bf \mu_k^T(\beta_0+ \sum_n \gamma_{nk}) \Lambda_k \mu_k} + {\bf \mu_k^T \Lambda_k (\beta_0 m_0 + \sum_n \gamma_{nk} x_n)} + {\rm const.} \end{eqnarray} Therefore,  q^{*} ({\bf \mu_k|\Lambda_k}) is a Gaussian distribution. So,  log q^{*} ({\bf \mu_k|\Lambda_k}) is given such as \begin{eqnarray} \log q^{*} ({\bf \mu_k|\Lambda_k})&=& -\frac{\beta_k}{2} {\bf (\mu_k - m_k)^T \Lambda_k (\mu_k - m_k)}\\ && + \frac{1}{2} \log| {\bf \Lambda_k } | + {\rm const.} \end{eqnarray} where \begin{eqnarray} \beta_k &=& \beta_0 + \sum_n \gamma_{nk}\\ {\bf m_k} &=& \frac{1}{\beta_k} (\beta_0 {\bf m_0} + \sum_n \gamma_{nk} {\bf x_n}) \end{eqnarray}

Next, we consider about  q^{*} ( {\bf \Lambda_k} ) . \begin{eqnarray} \log q^{*} ({\bf \Lambda_k}) &=& \log q^{*} ({\bf \mu_k \Lambda_k}) - \log q^{*} ({\bf \mu_k | \Lambda_k})\\ &=& -\frac{\beta_0}{2} {\bf (\mu_k - m_0)^T \Lambda_k (\mu_k - m_0)} + \frac{1}{2} \log |{\bf \Lambda_k}| - \frac{1}{2} Tr {\bf (W_0^{-1} \Lambda_k)} + \frac{\nu_0 - D -1}{2} \log |{\bf \Lambda_k} |\\ && - \frac{1}{2} \sum_n \gamma_{nk} {\bf (x_n - \mu_k)^T \Lambda_k (x_n - \mu_k)} + \frac{1}{2} \sum_n \gamma_{nk} \log|{\bf \Lambda_k}| + \frac{\beta_k}{2} {\bf (\mu_k - m_k)^T \Lambda_k (\mu_k - m_k)} - \frac{1}{2} \log| {\bf \Lambda_k } | + {\rm const.}\\ &=& -\frac{\beta_0}{2} Tr{{\bf (\mu_k - m_0)(\mu_k - m_0)^T \Lambda_k}}- \frac{1}{2} Tr {\bf (W_0^{-1} \Lambda_k)} - \frac{1}{2} \sum_n \gamma_{nk} Tr{{\bf (x_n - \mu_k)(x_n - \mu_k)^T \Lambda_k}}\\ && + \frac{\beta_k}{2} Tr{{\bf (\mu_k - m_k)(\mu_k - m_k)^T \Lambda_k}} + \frac{\nu_0 + \sum_n \gamma_{nk} - D -1}{2} \log |{\bf \Lambda_k} |+ {\rm const.}\\ &=& -\frac{1}{2} Tr{(\beta_0{\bf (\mu_k - m_0)(\mu_k - m_0)^T + W_0^{-1}} + \sum_n \gamma_{nk} {\bf (x_n - \mu_k)(x_n - \mu_k)^T - \beta_k (\mu_k - m_k)(\mu_k - m_k)^T) \Lambda_k}}\\ && + \frac{\nu_0 + \sum_n \gamma_{nk} - D -1}{2} \log |{\bf \Lambda_k} |+ {\rm const.} \end{eqnarray}

Therefore,  \log q^{*} ({\bf \Lambda_k}) is a Wishart distribution, given by \begin{eqnarray} \log q^{*} ({\bf \Lambda_k}) &=& -\frac{1}{2} Tr({\bf W_k^{-1} \Lambda_k}) + \frac{\nu_k - D -1}{2} \log |{\bf \Lambda_k} | + {\rm const.} \end{eqnarray} where \begin{eqnarray} {\bf W_k^{-1}} &=& \beta_0{\bf (\mu_k - m_0)(\mu_k - m_0)^T + W_0^{-1}}\\ &&+ \sum_n \gamma_{nk} {\bf (x_n - \mu_k)(x_n - \mu_k)^T - \beta_k (\mu_k - m_k)(\mu_k - m_k)^T}\\ &=& {\bf W_0^{-1}} + \beta_0 {\bf (\mu_k \mu_k^T - \mu_k m_0^T - m_0 \mu_k^T + m_0 m_0^T)}\\ &&+ \sum_n \gamma_{nk} {\bf x_n x_n^T} - \sum_n \gamma_{nk}({\bf x_n \mu_k^T + \mu_k x_n^T}) + \sum_n \gamma_{nk} {\bf \mu_k \mu_k^T}\\ && - (\beta_0 + \sum_n \gamma_{nk}){\bf \mu_k \mu_k^T + \mu_k} (\beta_0 {\bf m_0} + \sum_n \gamma_{nk} {\bf x_n})^T \\ &&+(\beta_0 {\bf m_0} + \sum_n \gamma_{nk} {\bf x_n})\mu_k^T - \beta_k {\bf m_k m_k^T}\\ &=&{\bf W_0^{-1}} + \beta_0 {\bf m_0 m_0^T} + \sum_n \gamma_{nk} {\bf x_n x_n^T} - \beta_k {\bf m_k m_k^T}\\ \nu_k &=& \nu_0 + \sum_n \gamma_{nk} \end{eqnarray}

From the above, we express  q^{*} ({\bf \mu, \Lambda}), as follows. \begin{eqnarray} q^{*} ({\bf \mu, \Lambda})&=& \prod_k q^{*} ({\bf \mu_k, \Lambda_k})\\ &=& \prod_k{\it N}{\bf(\mu_k|m_0, (\beta_0 \Lambda_k)^{-1})} {\it W}{\bf (\Lambda_k | W_k, \nu_k)} \end{eqnarray}