GMM導入

混合ガウス分布(Gaussian Mixture Model, GMM)は文字通り複数のガウス分布で構成されているモデルです。 具体的には以下のような2次元のデータを考えてみれば良いです。

f:id:endosan:20190702203216p:plain

これを3つのガウス分布からできると考えてモデル化すればGMMです。

f:id:endosan:20190702203122p:plain

このモデルの目標はデータがGMMからできているとして、それがどのクラスター(隠れ変数)からできているか推定することです。 EMアルゴリズムを利用して推定します。この記事ではGMMの導入をします。EMについては GMMとEMアルゴリズム - 情報関連の備忘録 で。 変分ベイズについては GMMと変分ベイズ - 情報関連の備忘録

About Gaussian mixture model

Let's consider about the likelihood of gaussian mixture model  p ({\bf x} ; {\bf \theta} ) and next we cover an optimization of the parameters. Here,  {\bf x} is an observation variable.

Assumption

In GMM, we assume the following distributions.

We introduce a discrete latent variable  {\bf z} (which is corresponding to a cluster number). This variable satisfies  z_k \in \{ 0, 1\} and  \sum_k z_k = 1. The probability of a latent variable  {\bf z} is given by \begin{eqnarray} p({\bf z}) &=& \prod_{k} p(z_k)\nonumber\\ &=&\prod_{k} \pi_k^{z_{k}} \end{eqnarray} where the mixing coefficients  \pi_k satisfy  0 \leq \pi_{k} \leq 1 and  \sum_{k} \pi_{k} = 1.

Then we assume that the distribution of  {\bf x} in cluster k is a gaussian such as \begin{eqnarray} p({\bf x}| z_k = 1) = {\it N}({\bf x} | {\bf \mu_{k}}, {\bf \Lambda}_{k}^{-1}) \end{eqnarray} where  {\bf \mu}_k is the mean vector for cluster k and  {\bf \Lambda}_k is the precision matrix for cluster k.

Therefore, we introduce  \theta which is all parameters, that is  \theta = ({\bf \pi}, {\bf \mu}, {\bf \Lambda}). The likelihood of gaussian mixture model is given by \begin{eqnarray} p ({\bf x} ; {\bf \theta}) &=& \sum_{\bf z} p({\bf z}) p({\bf x}|{\bf z})\\ &=&\sum_{k} \pi_{k} {\it N}({\bf x} | {\bf \mu_{k}}, {\bf \Lambda}_{k}^{-1}) \end{eqnarray} And, the joint probability of  {\bf x}, {\bf z} given  {\bf \theta} is given by \begin{eqnarray} p({\bf x}, {\bf z} | {\bf \theta}) &=& \prod_{k} p(z_k | {\bf \pi}) p({\bf x}|{\bf z} ; {\bf \mu}, {\bf \Lambda})\\ &=& \prod_{k} \pi_k^{z_{k}} {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k^{-1} )^{z_{k}} \end{eqnarray}

Introduce responsibility

We introduce a convenient variable responsibility.

The probability of  {\bf z} given { \bf x} and  {\bf \theta} is given by \begin{eqnarray} p(z_k = 1 | {\bf x}; {\bf \theta}) &\propto& p(z_k) p({\bf x}|z_k = 1; {\bf \theta})\\ p(z_k = 1 | {\bf x}; {\bf \theta})& = &\frac{\pi_k {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k )}{\sum_{k} \pi_k {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k^{-1})}\\ &=& \gamma_{k} \end{eqnarray} where  \gamma_k is the responsibility that cluster k takes for 'explaining' the observation  {\bf x}. Therefore, the probability of  {\bf z} is \begin{eqnarray} p({\bf z} | {\bf x}; {\bf \theta}) = \prod_k \gamma_k^{z_k} \end{eqnarray}

補足

データを生成したコードは以下です。

Xs = []
#--1つ目--
mu = [-2, -2]
sigma = [[3, 0], [0, 3]]
tmp = np.random.multivariate_normal(mu, sigma, 1000)
Xs.append(tmp)
#--2つ目--
mu = [0, 5]
sigma = [[3, 2], [2, 3]]
tmp = np.random.multivariate_normal(mu, sigma, 1000)
Xs.append(tmp)
#--3つ目--
mu = [5, 0]
sigma = [[2, -1], [-1, 2]]
tmp = np.random.multivariate_normal(mu, sigma, 1000)
Xs.append(tmp)

colors = ["green", "blue", "r"]
fig = plt.figure()
ax = fig.add_subplot(111)
for X, color in zip(Xs,colors):
    ax.scatter(X[:, 0], X[:, 1], alpha=0.4, color=color)
ax.set_xlabel("first dimension")
ax.set_ylabel("second dimension")
plt.show()