Hi Guys,

Today, let’s talk about the GMM (Gaussian Mixture Model) and how to use the GMM in scikit learn and Spark.

### I. Background Knowledge

**Maximum-likelihood estimation** (**MLE**) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model’s parameters.

**Expectation–maximization** (**EM**) **algorithm** is an iterative method for finding maximum likelihood, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the *E*step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

**GMM**：

#### Some definitions and terms from wiki

*N*random variables corresponding to

**observations,**each assumed to be distributed according to a mixture of

*K*

**components**, with each component belonging to the same parametric family of distributions (e.g., all normal, all Zipfian, etc.) but with different parameters

*N*corresponding random

**latent variables s**pecifying the identity of the mixture component of each observation, each distributed according to a

*K*-dimensional categorical distribution

*μ*∈

**R**— mean (location)

*σ*

^{2}> 0 — variance (squared scale). Therefore, if the GMM model returns

*μ*and

*σ*

^{2}, we can plot the mixed distribution functions.

### II. Mathematics Proof

Here are the very detailed and relatively easy to understand videos illustrate GMM and EM algorithms