Today, let’s talk about the GMM (Gaussian Mixture Model) and how to use the GMM in scikit learn and Spark.
I. Background Knowledge
Maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model’s parameters.
Expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the Estep. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
1. Objective function：§Maximize the log-likelihood and using EM as framework
2. EM algorithm：
E-step: Compute posterior probability of membership.
M-step: Optimize parameters.
Perform soft assignment during E-step.
Here’s a good youtube tutorial to tell the difference between K-means, GMM and EM.
Some definitions and terms from wiki
random variables corresponding to observations,
each assumed to be distributed according to a mixture of K components
, with each component belonging to the same parametric family
of distributions (e.g., all normal
, all Zipfian
, etc.) but with different parameters
3. Normal distribution (Gaussian distribution
) . Notation is
, where μ ∈ R
— mean (location
σ2 > 0
— variance (squared scale
). Therefore, if the GMM model returns μ
, we can plot the mixed distribution functions.
II. Mathematics Proof
Here are the very detailed and relatively easy to understand videos illustrate GMM and EM algorithms
II. IPython Code (Github)
The following contour plot image indicates the envelope of two GMM components distributions
The following Code is based on PySpark
We can see both of scikit-learn and pyspark model indicates similar GMM components mixture.
1. http://blog.csdn.net/zouxy09/article/details/8537620 (in Chinese)