GMM導入
混合ガウス分布(Gaussian Mixture Model, GMM)は文字通り複数のガウス分布で構成されているモデルです。 具体的には以下のような2次元のデータを考えてみれば良いです。
これを3つのガウス分布からできると考えてモデル化すればGMMです。
このモデルの目標はデータがGMMからできているとして、それがどのクラスター(隠れ変数)からできているか推定することです。 EMアルゴリズムを利用して推定します。この記事ではGMMの導入をします。EMについては GMMとEMアルゴリズム - 情報関連の備忘録 で。 変分ベイズについては GMMと変分ベイズ - 情報関連の備忘録 。
About Gaussian mixture model
Let's consider about the likelihood of gaussian mixture model and next we cover an optimization of the parameters. Here, is an observation variable.
Assumption
In GMM, we assume the following distributions.
We introduce a discrete latent variable (which is corresponding to a cluster number). This variable satisfies and . The probability of a latent variable is given by \begin{eqnarray} p({\bf z}) &=& \prod_{k} p(z_k)\nonumber\\ &=&\prod_{k} \pi_k^{z_{k}} \end{eqnarray} where the mixing coefficients satisfy and .
Then we assume that the distribution of in cluster k is a gaussian such as \begin{eqnarray} p({\bf x}| z_k = 1) = {\it N}({\bf x} | {\bf \mu_{k}}, {\bf \Lambda}_{k}^{-1}) \end{eqnarray} where is the mean vector for cluster k and is the precision matrix for cluster k.
Therefore, we introduce which is all parameters, that is . The likelihood of gaussian mixture model is given by \begin{eqnarray} p ({\bf x} ; {\bf \theta}) &=& \sum_{\bf z} p({\bf z}) p({\bf x}|{\bf z})\\ &=&\sum_{k} \pi_{k} {\it N}({\bf x} | {\bf \mu_{k}}, {\bf \Lambda}_{k}^{-1}) \end{eqnarray} And, the joint probability of given is given by \begin{eqnarray} p({\bf x}, {\bf z} | {\bf \theta}) &=& \prod_{k} p(z_k | {\bf \pi}) p({\bf x}|{\bf z} ; {\bf \mu}, {\bf \Lambda})\\ &=& \prod_{k} \pi_k^{z_{k}} {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k^{-1} )^{z_{k}} \end{eqnarray}
Introduce responsibility
We introduce a convenient variable responsibility
.
The probability of given and is given by
\begin{eqnarray}
p(z_k = 1 | {\bf x}; {\bf \theta}) &\propto& p(z_k) p({\bf x}|z_k = 1; {\bf \theta})\\
p(z_k = 1 | {\bf x}; {\bf \theta})& = &\frac{\pi_k {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k )}{\sum_{k} \pi_k {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k^{-1})}\\
&=& \gamma_{k}
\end{eqnarray}
where is the responsibility
that cluster k takes for 'explaining' the observation .
Therefore, the probability of is
\begin{eqnarray}
p({\bf z} | {\bf x}; {\bf \theta}) = \prod_k \gamma_k^{z_k}
\end{eqnarray}
補足
データを生成したコードは以下です。
Xs = [] #--1つ目-- mu = [-2, -2] sigma = [[3, 0], [0, 3]] tmp = np.random.multivariate_normal(mu, sigma, 1000) Xs.append(tmp) #--2つ目-- mu = [0, 5] sigma = [[3, 2], [2, 3]] tmp = np.random.multivariate_normal(mu, sigma, 1000) Xs.append(tmp) #--3つ目-- mu = [5, 0] sigma = [[2, -1], [-1, 2]] tmp = np.random.multivariate_normal(mu, sigma, 1000) Xs.append(tmp) colors = ["green", "blue", "r"] fig = plt.figure() ax = fig.add_subplot(111) for X, color in zip(Xs,colors): ax.scatter(X[:, 0], X[:, 1], alpha=0.4, color=color) ax.set_xlabel("first dimension") ax.set_ylabel("second dimension") plt.show()