Study Apache Spark MLlib on IPython—Clustering—GMM

Hi Guys,

Today, let’s talk about the GMM (Gaussian Mixture Model) and how to use the GMM in scikit learn and Spark.

I. Background Knowledge

Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model’s parameters.

Expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the Estep. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

GMM：

1. Objective function：§Maximize the log-likelihood and using EM as framework

2. EM algorithm：

E-step: Compute posterior probability of membership.

M-step: Optimize parameters.

Perform soft assignment during E-step.

Here’s a good youtube tutorial to tell the difference between K-means, GMM and EM.

Some definitions and terms from wiki

1. N random variables corresponding to observations, each assumed to be distributed according to a mixture of K components, with each component belonging to the same parametric family of distributions (e.g., all normal, all Zipfian, etc.) but with different parameters

2. N corresponding random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution

3. Normal distribution (Gaussian distribution) . Notation is

\mathcal{N}(\mu,\,\sigma^2)

, where μ ∈ R — mean (location)
σ² > 0 — variance (squared scale). Therefore, if the GMM model returns μ and σ², we can plot the mixed distribution functions.