Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
scikit-learn
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Ian Johnson
scikit-learn
Commits
ac62e91d
Commit
ac62e91d
authored
13 years ago
by
Gael Varoquaux
Browse files
Options
Downloads
Patches
Plain Diff
DOC: reorganize GMM docs
parent
ad500758
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/modules/mixture.rst
+139
-128
139 additions, 128 deletions
doc/modules/mixture.rst
with
139 additions
and
128 deletions
doc/modules/mixture.rst
+
139
−
128
View file @
ac62e91d
...
@@ -29,13 +29,15 @@ mixture models as generalizing k-means clustering to incorporate
...
@@ -29,13 +29,15 @@ mixture models as generalizing k-means clustering to incorporate
information about the covariance structure of the data as well as the
information about the covariance structure of the data as well as the
centers of the latent Gaussians.
centers of the latent Gaussians.
Different Gaussian mixture models classes
The `scikit-learn` implements different classes to estimate Gaussian
=========================================
mixture models, that correspond to different estimation strategies,
detailed below.
GMM classifier
GMM classifier
---------------
===============
The :class:`GMM` object implements the expectation-maximization (EM)
The :class:`GMM` object implements the
:ref:`expectation-maximization <expectation_maximization>` (EM)
algorithm for fitting mixture-of-Gaussian models. It can also draw
algorithm for fitting mixture-of-Gaussian models. It can also draw
confidence ellipsoids for multivariate models, and compute the
confidence ellipsoids for multivariate models, and compute the
Bayesian Information Criterion to assess the number of clusters in the
Bayesian Information Criterion to assess the number of clusters in the
...
@@ -49,6 +51,10 @@ the :meth:`GMM.predict` method.
...
@@ -49,6 +51,10 @@ the :meth:`GMM.predict` method.
sample belonging to the various Gaussians may be retrieved using the
sample belonging to the various Gaussians may be retrieved using the
:meth:`GMM.predict_proba` method.
:meth:`GMM.predict_proba` method.
The :class:`GMM` comes with different options to constrain the covariance
of the difference classes estimated: spherical, diagonal, tied or full
covariance.
.. figure:: ../auto_examples/mixture/images/plot_gmm_classifier_1.png
.. figure:: ../auto_examples/mixture/images/plot_gmm_classifier_1.png
:target: ../auto_examples/mixture/plot_gmm_classifier.html
:target: ../auto_examples/mixture/plot_gmm_classifier.html
:align: center
:align: center
...
@@ -62,76 +68,54 @@ the :meth:`GMM.predict` method.
...
@@ -62,76 +68,54 @@ the :meth:`GMM.predict` method.
* See :ref:`example_mixture_plot_gmm_pdf.py` for an example on plotting the
* See :ref:`example_mixture_plot_gmm_pdf.py` for an example on plotting the
density estimation.
density estimation.
* See :ref:`example_mixture_plot_gmm_model selection.py` for an exampl
e
Pros and cons of class :class:`GMM`: expectation-maximization inferenc
e
of model selection performed with classical GMM.
------------------------------------------------------------------------
Pros
.....
VBGMM classifier: variational Gaussian mixtures
:Speed: it is the fastest algorithm for learning mixture models
------------------------------------------------
:Agnostic: as this algorithm maximizes only the likelihood, it
will not bias the means towards zero, or bias the cluster sizes to
have specific structures that might or might not apply.
The :class:`VBGMM` object implements a variant of the Gaussian mixture
Cons
model with variational inference algorithms. The API is identical to
....
:class:`GMM`. It is essentially a middle-ground between :class:`GMM`
and :class:`DPGMM`, as it has some of the properties of the Dirichlet
process.
:Singularities: when one has insufficiently many points per
mixture, estimating the covariance matrices becomes difficult,
and the algorithm is known to diverge and find solutions with
infinite likelihood unless one regularizes the covariances artificially.
DPGMM classifier: Infinite Gaussian mixtures
:Number of components: this algorithm will always use all the
---------------------------------------------
components it has access to, needing held-out data
or information theoretical criteria to decide how many components to use
in the absence of external cues.
The :class:`DPGMM` object implements a variant of the Gaussian mixture
Selecting the number of components in a classical GMM
model with a variable (but bounded) number of components using the
------------------------------------------------------
Dirichlet Process. The API is identical to :class:`GMM`.
The BIC criterion can be used to select the number of components in a GMM
in an efficient way. In theory, it recovers the true number of components
only in the asymptotic regime (i.e. if much data is available).
Note that using a :ref:`DPGMM <dpgmm>` avoids the specification of the
number of components for a Gaussian mixture model.
.. figure:: ../auto_examples/mixture/images/plot_gmm_1.png
.. figure:: ../auto_examples/mixture/images/plot_gmm_
selection_
1.png
:target: ../auto_examples/mixture/plot_gmm.html
:target: ../auto_examples/mixture/plot_gmm
_selection
.html
:align: center
:align: center
:scale: 70%
:scale: 50%
The example above compares a Gaussian mixture model fitted with 5
components on a dataset, to a DPGMM model. We can see that the DPGMM is
able to limit itself to only 2 components. With very little observations,
the DPGMM can take a conservative stand, and fit only one component.
.. topic:: Examples:
.. topic:: Examples:
* See :ref:`example_mixture_plot_gmm.py` for an example on plotting the
* See :ref:`example_mixture_plot_gmm_selection.py` for an example
confidence ellipsoids for both :class:`GMM` and :class:`DPGMM`.
of model selection performed with classical GMM.
.. topic:: Derivation:
* See `here <dp-derivation.html>`_ the full derivation of this
algorithm.
.. toctree::
:hidden:
dp-derivation.rst
Background on the inference of Gaussian mixture models
.. _expectation_maximization:
========================================================
Fitting the best mixture of Gaussians possible on a
given dataset (as measured by the likelihood criterion) is exponential
in the assumed number of latent Gaussian distributions. For this
reason most of the time one uses approximate inference techniques in
these models that, while not guaranteed to return the optimal
solution, do converge quickly to a local optimum. To improve the
quality it is usual to fit these models many times with different
parameters and choose the best result, as measured by the likelihood
or some other external criterion. Here in `scikit-learn` we implement
two approximate inference algorithms for mixtures of Gaussians:
expectation-maximization and variational inference. We also implement
a variant of the mixture model, known as the Dirichlet Process prior,
that doesn't need cross-validation procedures to choose the number of
components, and at the expense of extra computational time the user
only needs to specify a loose upper bound on this number and a
concentration parameter.
Expectation-maximization
Estimation algorithm
Expectation-maximization
------------------------
------------------------
-----------------------
The main difficulty in learning Gaussian mixture models from unlabeled
The main difficulty in learning Gaussian mixture models from unlabeled
data is that it is one usually doesn't know which points came from
data is that it is one usually doesn't know which points came from
...
@@ -148,38 +132,47 @@ assignments. Repeating this process is guaranteed to always converge
...
@@ -148,38 +132,47 @@ assignments. Repeating this process is guaranteed to always converge
to a local optimum. In the `scikit-learn` this algorithm in
to a local optimum. In the `scikit-learn` this algorithm in
implemented in the :class:`GMM` class.
implemented in the :class:`GMM` class.
Advantages of expectation-maximization:
:Speed: it is the fastest algorithm for learning mixture models
VBGMM classifier: variational Gaussian mixtures
================================================
:Agnostic: as this algorithm maximizes only the likelihood, it
The :class:`VBGMM` object implements a variant of the Gaussian mixture
will not bias the means towards zero, or bias the cluster sizes to
model with :ref:`variational inference <variational_inference>` algorithms. The API is identical to
have specific structures that might or might not apply.
:class:`GMM`. It is essentially a middle-ground between :class:`GMM`
and :class:`DPGMM`, as it has some of the properties of the Dirichlet
process.
Disadvantages of expectation-maximization:
Pros and cons of class :class:`VBGMM`: variational inference
-------------------------------------------------------------
:Singularities: when one has insufficiently many points per
Pros
mixture, estimating the covariance matrices becomes difficult,
.....
and the algorithm is known to diverge and find solutions with
infinite likelihood unless one regularizes the covariances artificially.
:Number of components: this algorithm will always use all the
:Regularization: due to the incorporation of prior information,
components it has access to, needing held-out data
variational solutions have less pathological special cases than
or information theoretical criteria to decide how many components to use
expectation-maximization solutions. One can then use full
in the absence of external cues.
covariance matrices in high dimensions or in cases where some
components might be centered around a single point without
risking divergence.
.. figure:: ../auto_examples/mixture/images/plot_gmm_selection_1.png
Cons
:target: ../auto_examples/mixture/plot_gmm_selection.html
.....
:align: center
:scale: 50%
**Selecting the number of components in a calssical GMM:** *the BIC
:Bias: to regularize a model one has to add biases. The
criterion is an efficient procedure for that purpose, but holds
variational algorithm will bias all the means towards the origin
only in the asymptotic regime (if much data is available).*
(part of the prior information adds a "ghost point" in the origin
to every mixture component) and it will bias the covariances to
be more spherical. It will also, depending on the concentration
parameter, bias the cluster structure either towards uniformity
or towards a rich-get-richer scenario.
:Hyperparameters: this algorithm needs an extra hyperparameter
that might need experimental tuning via cross-validation.
Variational inference
.. _variational_inference:
---------------------
Estimation algorithm: variational inference
---------------------------------------------
Variational inference is an extension of expectation-maximization that
Variational inference is an extension of expectation-maximization that
maximizes a lower bound on model evidence (including
maximizes a lower bound on model evidence (including
...
@@ -203,37 +196,77 @@ to some mixture components getting almost all the points while most
...
@@ -203,37 +196,77 @@ to some mixture components getting almost all the points while most
mixture components will be centered on just a few of the remaining
mixture components will be centered on just a few of the remaining
points.
points.
Simply switching from expectation-maximization to variational
.. _dpgmm:
inference has the main following advantage:
:Regularization: due to the incorporation of prior information,
DPGMM classifier: Infinite Gaussian mixtures
variational solutions have less pathological special cases than
============================================
expectation-maximization solutions. One can then use full
covariance matrices in high dimensions or in cases where some
components might be centered around a single point without
risking divergence.
But brings with it the following disadvantage:
The :class:`DPGMM` object implements a variant of the Gaussian mixture
model with a variable (but bounded) number of components using the
Dirichlet Process. The API is identical to :class:`GMM`.
This class doesn't require the user to choose the number of
components, and at the expense of extra computational time the user
only needs to specify a loose upper bound on this number and a
concentration parameter.
:Bias: to regularize a model one has to add biases. The
.. figure:: ../auto_examples/mixture/images/plot_gmm_1.png
variational algorithm will bias all the means towards the origin
:target: ../auto_examples/mixture/plot_gmm.html
(part of the prior information adds a "ghost point" in the origin
:align: center
to every mixture component) and it will bias the covariances to
:scale: 70%
be more spherical. It will also, depending on the concentration
parameter, bias the cluster structure either towards uniformity
or towards a rich-get-richer scenario.
:Hyper-parameters: this algorithm needs an extra hyper-parameter
The example above compares a Gaussian mixture model fitted with 5
that might need experimental tuning via cross-validation.
components on a dataset, to a DPGMM model. We can see that the DPGMM is
able to limit itself to only 2 components. With very little observations,
the DPGMM can take a conservative stand, and fit only one component.
.. topic:: Examples:
* See :ref:`example_mixture_plot_gmm.py` for an example on plotting the
confidence ellipsoids for both :class:`GMM` and :class:`DPGMM`.
.. topic:: Derivation:
* See `here <dp-derivation.html>`_ the full derivation of this
algorithm.
Pros and cons of class :class:`DPGMM`: Diriclet process mixture model
----------------------------------------------------------------------
Pros
.....
:Less sensitivity to the number of parameters: unlike finite
models, which will almost always use all components as much as
they can, and hence will produce wildly different solutions for
different numbers of components, the Dirichlet process solution
won't change much with changes to the parameters, leading to more
stability and less tuning.
:No need to specify the number of components: only an upper bound of
this number needs to be provided. Note however that the DPMM is not
a formal model selection procedure, and thus provides no guarantee
on the result.
Cons
.....
:Speed: the extra parametrization necessary for variational
inference and for the structure of the Dirichlet process can and
will make inference slower, although not by much.
:Bias: as in variational techniques, but only more so, there are
many implicit biases in the Dirichlet process and the inference
algorithms, and whenever there is a mismatch between these biases
and the data it might be possible to fit better models using a
finite mixture.
.. _dirichlet_process:
.. _dirichlet_process:
The Dirichlet Process
The Dirichlet Process
---------------------
---------------------
Here we
will talk only about using
variational inference algorithms on
Here we
describe
variational inference algorithms on
Dirichlet process
Dirichlet process mixtures, for reasons of simplicity
.
mixtures
.
One of the main advantages of variational techniques is that they can
One of the main advantages of variational techniques is that they can
incorporate prior information to the model in many different ways. The
incorporate prior information to the model in many different ways. The
...
@@ -274,31 +307,9 @@ on the number of mixture components (this upper bound, assuming it is
...
@@ -274,31 +307,9 @@ on the number of mixture components (this upper bound, assuming it is
higher than the "true" number of components, affects only algorithmic
higher than the "true" number of components, affects only algorithmic
complexity, not the actual number of components used).
complexity, not the actual number of components used).
The advantages of using a Dirichlet process mixture model are:
.. toctree::
:hidden:
:Less sensitivity to the number of parameters: unlike finite
models, which will almost always use all components as much as
they can, and hence will produce wildly different solutions for
different numbers of components, the Dirichlet process solution
won't change much with changes to the parameters, leading to more
stability and less tuning.
:No need to specify the number of components: only an upper bound of
this number needs to be provided. Note however that the DPMM is not
a formal model selection procedure, and thus provides no guarantee
on the result.
The main disadvantages of using the Dirichlet process are:
:Speed: the extra parametrization necessary for variational
inference and for the structure of the Dirichlet process can and
will make inference slower, although not by much.
:Bias: as in variational techniques, but only more so, there are
many implicit biases in the Dirichlet process and the inference
algorithms, and whenever there is a mismatch between these biases
and the data it might be possible to fit better models using a
finite mixture.
dp-derivation.rst
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment