混合ガウス分布(Gaussian Mixture Model, GMM)は文字通り複数のガウス分布で構成されているモデルです。具体的には以下のような２次元のデータを考えてみれば良いです。

f:id:endosan:20190702203216p:plain

これを３つのガウス分布からできると考えてモデル化すればGMMです。

f:id:endosan:20190702203122p:plain

このモデルの目標はデータがGMMからできているとして、それがどのクラスター(隠れ変数)からできているか推定することです。 EMアルゴリズムを利用して推定します。この記事ではGMMの導入をします。EMについては GMMとEMアルゴリズム - 情報関連の備忘録で。変分ベイズについては GMMと変分ベイズ - 情報関連の備忘録。

About Gaussian mixture model

Let's consider about the likelihood of gaussian mixture model $p ({\bf x} ; {\bf \theta} )$ and next we cover an optimization of the parameters. Here, ${\bf x}$ is an observation variable.

Assumption

In GMM, we assume the following distributions.

We introduce a discrete latent variable ${\bf z}$ (which is corresponding to a cluster number). This variable satisfies $z_k \in \{ 0, 1\}$ and $\sum_k z_k = 1$ . The probability of a latent variable ${\bf z}$ is given by \begin{eqnarray} p({\bf z}) &=& \prod_{k} p(z_k)\nonumber\\ &=&\prod_{k} \pi_k^{z_{k}} \end{eqnarray} where the mixing coefficients $\pi_k$ satisfy $0 \leq \pi_{k} \leq 1$ and $\sum_{k} \pi_{k} = 1$ .

Then we assume that the distribution of ${\bf x}$ in cluster k is a gaussian such as \begin{eqnarray} p({\bf x}| z_k = 1) = {\it N}({\bf x} | {\bf \mu_{k}}, {\bf \Lambda}_{k}^{-1}) \end{eqnarray} where ${\bf \mu}_k$ is the mean vector for cluster k and ${\bf \Lambda}_k$ is the precision matrix for cluster k.

Therefore, we introduce $\theta$ which is all parameters, that is $\theta = ({\bf \pi}, {\bf \mu}, {\bf \Lambda})$ . The likelihood of gaussian mixture model is given by \begin{eqnarray} p ({\bf x} ; {\bf \theta}) &=& \sum_{\bf z} p({\bf z}) p({\bf x}|{\bf z})\\ &=&\sum_{k} \pi_{k} {\it N}({\bf x} | {\bf \mu_{k}}, {\bf \Lambda}_{k}^{-1}) \end{eqnarray} And, the joint probability of ${\bf x}, {\bf z}$ given ${\bf \theta}$ is given by \begin{eqnarray} p({\bf x}, {\bf z} | {\bf \theta}) &=& \prod_{k} p(z_k | {\bf \pi}) p({\bf x}|{\bf z} ; {\bf \mu}, {\bf \Lambda})\\ &=& \prod_{k} \pi_k^{z_{k}} {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k^{-1} )^{z_{k}} \end{eqnarray}

Introduce responsibility

We introduce a convenient variable responsibility.

The probability of ${\bf z}$ given ${ \bf x}$ and ${\bf \theta}$ is given by \begin{eqnarray} p(z_k = 1 | {\bf x}; {\bf \theta}) &\propto& p(z_k) p({\bf x}|z_k = 1; {\bf \theta})\\ p(z_k = 1 | {\bf x}; {\bf \theta})& = &\frac{\pi_k {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k )}{\sum_{k} \pi_k {\it N}({\bf x}|{\bf \mu}_k, {\bf \Lambda}_k^{-1})}\\ &=& \gamma_{k} \end{eqnarray} where $\gamma_k$ is the responsibility that cluster k takes for 'explaining' the observation ${\bf x}$ . Therefore, the probability of ${\bf z}$ is \begin{eqnarray} p({\bf z} | {\bf x}; {\bf \theta}) = \prod_k \gamma_k^{z_k} \end{eqnarray}

補足

データを生成したコードは以下です。

Xs = []
#--１つ目--
mu = [-2, -2]
sigma = [[3, 0], [0, 3]]
tmp = np.random.multivariate_normal(mu, sigma, 1000)
Xs.append(tmp)
#--２つ目--
mu = [0, 5]
sigma = [[3, 2], [2, 3]]
tmp = np.random.multivariate_normal(mu, sigma, 1000)
Xs.append(tmp)
#--３つ目--
mu = [5, 0]
sigma = [[2, -1], [-1, 2]]
tmp = np.random.multivariate_normal(mu, sigma, 1000)
Xs.append(tmp)

colors = ["green", "blue", "r"]
fig = plt.figure()
ax = fig.add_subplot(111)
for X, color in zip(Xs,colors):
    ax.scatter(X[:, 0], X[:, 1], alpha=0.4, color=color)
ax.set_xlabel("first dimension")
ax.set_ylabel("second dimension")
plt.show()

計算機関連作業メモ

作業メモ

GMM導入

About Gaussian mixture model

Assumption

Introduce responsibility

補足