DOC: reorganize GMM docs

ac62e91d · Gael Varoquaux · ad500758 · ac62e91d
Commit ac62e91d authored 13 years ago by Gael Varoquaux
--- a/doc/modules/mixture.rst
+++ b/doc/modules/mixture.rst
@@ -29,13 +29,15 @@ mixture models as generalizing k-means clustering to incorporate
 information about the covariance structure of the data as well as the
 centers of the latent Gaussians.
-Different Gaussian mixture models classes
+The `scikit-learn` implements different classes to estimate Gaussian
-=========================================
+mixture models, that correspond to different estimation strategies,
+detailed below.
 GMM classifier
---------------
+===============
-The :class:`GMM` object implements the expectation-maximization (EM)
+The :class:`GMM` object implements the 
+:ref:`expectation-maximization <expectation_maximization>` (EM)
 algorithm for fitting mixture-of-Gaussian models. It can also draw
 confidence ellipsoids for multivariate models, and compute the
 Bayesian Information Criterion to assess the number of clusters in the
@@ -49,6 +51,10 @@ the :meth:`GMM.predict` method.
    sample belonging to the various Gaussians may be retrieved using the
    :meth:`GMM.predict_proba` method.
+The :class:`GMM` comes with different options to constrain the covariance
+of the difference classes estimated: spherical, diagonal, tied or full
+covariance.
 .. figure:: ../auto_examples/mixture/images/plot_gmm_classifier_1.png
   :target: ../auto_examples/mixture/plot_gmm_classifier.html
   :align: center
@@ -62,76 +68,54 @@ the :meth:`GMM.predict` method.
    * See :ref:`example_mixture_plot_gmm_pdf.py` for an example on plotting the 
      density estimation.
-    * See :ref:`example_mixture_plot_gmm_model selection.py` for an example
+Pros and cons of class :class:`GMM`: expectation-maximization inference
-      of model selection performed with classical GMM.
+------------------------------------------------------------------------
+Pros
+.....
-VBGMM classifier: variational Gaussian mixtures
+:Speed: it is the fastest algorithm for learning mixture models
------------------------------------------------
+:Agnostic: as this algorithm maximizes only the likelihood, it
+  will not bias the means towards zero, or bias the cluster sizes to
+  have specific structures that might or might not apply.
-The :class:`VBGMM` object implements a variant of the Gaussian mixture
+Cons
-model with variational inference algorithms. The API is identical to
+....
-:class:`GMM`. It is essentially a middle-ground between :class:`GMM`
-and :class:`DPGMM`, as it has some of the properties of the Dirichlet
-process.
+:Singularities: when one has insufficiently many points per
+   mixture, estimating the covariance matrices becomes difficult,
+   and the algorithm is known to diverge and find solutions with
+   infinite likelihood unless one regularizes the covariances artificially.
-DPGMM classifier: Infinite Gaussian mixtures
+:Number of components: this algorithm will always use all the
---------------------------------------------
+   components it has access to, needing held-out data
+   or information theoretical criteria to decide how many components to use 
+   in the absence of external cues.
-The :class:`DPGMM` object implements a variant of the Gaussian mixture
+Selecting the number of components in a classical GMM 
-model with a variable (but bounded) number of components using the
+------------------------------------------------------
-Dirichlet Process. The API is identical to :class:`GMM`. 
+The BIC criterion can be used to select the number of components in a GMM
+in an efficient way. In theory, it recovers the true number of components
+only in the asymptotic regime (i.e. if much data is available).
+Note that using a :ref:`DPGMM <dpgmm>` avoids the specification of the
+number of components for a Gaussian mixture model.
-.. figure:: ../auto_examples/mixture/images/plot_gmm_1.png
+.. figure:: ../auto_examples/mixture/images/plot_gmm_selection_1.png
-   :target: ../auto_examples/mixture/plot_gmm.html
+   :target: ../auto_examples/mixture/plot_gmm_selection.html
   :align: center
-   :scale: 70%
+   :scale: 50%
-The example above compares a Gaussian mixture model fitted with 5
-components on a dataset, to a DPGMM model. We can see that the DPGMM is
-able to limit itself to only 2 components. With very little observations,
-the DPGMM can take a conservative stand, and fit only one component.
 .. topic:: Examples:
-    * See :ref:`example_mixture_plot_gmm.py` for an example on plotting the
+    * See :ref:`example_mixture_plot_gmm_selection.py` for an example
-      confidence ellipsoids for both :class:`GMM` and :class:`DPGMM`.
+      of model selection performed with classical GMM.
-.. topic:: Derivation:
-   * See `here <dp-derivation.html>`_ the full derivation of this
-     algorithm.
-.. toctree::
-    :hidden:
-    dp-derivation.rst
-Background on the inference of Gaussian mixture models
+.. _expectation_maximization:
-========================================================
-Fitting the best mixture of Gaussians possible on a
-given dataset (as measured by the likelihood criterion) is exponential
-in the assumed number of latent Gaussian distributions. For this
-reason most of the time one uses approximate inference techniques in
-these models that, while not guaranteed to return the optimal
-solution, do converge quickly to a local optimum. To improve the
-quality it is usual to fit these models many times with different
-parameters and choose the best result, as measured by the likelihood
-or some other external criterion. Here in `scikit-learn` we implement
-two approximate inference algorithms for mixtures of Gaussians:
-expectation-maximization and variational inference. We also implement
-a variant of the mixture model, known as the Dirichlet Process prior,
-that doesn't need cross-validation procedures to choose the number of
-components, and at the expense of extra computational time the user
-only needs to specify a loose upper bound on this number and a
-concentration parameter.
-Expectation-maximization
+Estimation algorithm Expectation-maximization
------------------------
+-----------------------------------------------
 The main difficulty in learning Gaussian mixture models from unlabeled
 data is that it is one usually doesn't know which points came from
@@ -148,38 +132,47 @@ assignments. Repeating this process is guaranteed to always converge
 to a local optimum. In the `scikit-learn` this algorithm in
 implemented in the :class:`GMM` class.
-Advantages of expectation-maximization:
-:Speed: it is the fastest algorithm for learning mixture models
+VBGMM classifier: variational Gaussian mixtures
+================================================
-:Agnostic: as this algorithm maximizes only the likelihood, it
+The :class:`VBGMM` object implements a variant of the Gaussian mixture
-   will not bias the means towards zero, or bias the cluster sizes to
+model with :ref:`variational inference <variational_inference>` algorithms. The API is identical to
-   have specific structures that might or might not apply.
+:class:`GMM`. It is essentially a middle-ground between :class:`GMM`
+and :class:`DPGMM`, as it has some of the properties of the Dirichlet
+process.
-Disadvantages of expectation-maximization:
+Pros and cons of class :class:`VBGMM`: variational inference
+-------------------------------------------------------------
-:Singularities: when one has insufficiently many points per
+Pros
-   mixture, estimating the covariance matrices becomes difficult,
+.....
-   and the algorithm is known to diverge and find solutions with
-   infinite likelihood unless one regularizes the covariances artificially.
-:Number of components: this algorithm will always use all the
+:Regularization: due to the incorporation of prior information,
-   components it has access to, needing held-out data
+   variational solutions have less pathological special cases than
-   or information theoretical criteria to decide how many components to use
+   expectation-maximization solutions. One can then use full
-   in the absence of external cues.
+   covariance matrices in high dimensions or in cases where some
+   components might be centered around a single point without
+   risking divergence.
-.. figure:: ../auto_examples/mixture/images/plot_gmm_selection_1.png
+Cons
-   :target: ../auto_examples/mixture/plot_gmm_selection.html
+.....
-   :align: center
-   :scale: 50%
-   **Selecting the number of components in a calssical GMM:** *the BIC
+:Bias: to regularize a model one has to add biases. The
-   criterion is an efficient procedure for that purpose, but holds
+   variational algorithm will bias all the means towards the origin
-   only in the asymptotic regime (if much data is available).*
+   (part of the prior information adds a "ghost point" in the origin
+   to every mixture component) and it will bias the covariances to
+   be more spherical. It will also, depending on the concentration
+   parameter, bias the cluster structure either towards uniformity
+   or towards a rich-get-richer scenario.
+:Hyperparameters: this algorithm needs an extra hyperparameter
+   that might need experimental tuning via cross-validation.
-Variational inference
+.. _variational_inference:
---------------------
+Estimation algorithm: variational inference
+---------------------------------------------
 Variational inference is an extension of expectation-maximization that
 maximizes a lower bound on model evidence (including
@@ -203,37 +196,77 @@ to some mixture components getting almost all the points while most
 mixture components will be centered on just a few of the remaining
 points.
-Simply switching from expectation-maximization to variational
+.. _dpgmm:
-inference has the main following advantage:
-:Regularization: due to the incorporation of prior information,
+DPGMM classifier: Infinite Gaussian mixtures
-   variational solutions have less pathological special cases than
+============================================
-   expectation-maximization solutions. One can then use full
-   covariance matrices in high dimensions or in cases where some
-   components might be centered around a single point without
-   risking divergence.
-But brings with it the following disadvantage:
+The :class:`DPGMM` object implements a variant of the Gaussian mixture
+model with a variable (but bounded) number of components using the
+Dirichlet Process. The API is identical to :class:`GMM`. 
+This class doesn't require the user to choose the number of
+components, and at the expense of extra computational time the user
+only needs to specify a loose upper bound on this number and a
+concentration parameter.
-:Bias: to regularize a model one has to add biases. The
+.. figure:: ../auto_examples/mixture/images/plot_gmm_1.png
-   variational algorithm will bias all the means towards the origin
+   :target: ../auto_examples/mixture/plot_gmm.html
-   (part of the prior information adds a "ghost point" in the origin
+   :align: center
-   to every mixture component) and it will bias the covariances to
+   :scale: 70%
-   be more spherical. It will also, depending on the concentration
-   parameter, bias the cluster structure either towards uniformity
-   or towards a rich-get-richer scenario.
-:Hyper-parameters: this algorithm needs an extra hyper-parameter
+The example above compares a Gaussian mixture model fitted with 5
-   that might need experimental tuning via cross-validation.
+components on a dataset, to a DPGMM model. We can see that the DPGMM is
+able to limit itself to only 2 components. With very little observations,
+the DPGMM can take a conservative stand, and fit only one component.
+.. topic:: Examples:
+    * See :ref:`example_mixture_plot_gmm.py` for an example on plotting the
+      confidence ellipsoids for both :class:`GMM` and :class:`DPGMM`.
+.. topic:: Derivation:
+   * See `here <dp-derivation.html>`_ the full derivation of this
+     algorithm.
+Pros and cons of class :class:`DPGMM`: Diriclet process mixture model
+----------------------------------------------------------------------
+Pros
+.....
+:Less sensitivity to the number of parameters: unlike finite
+   models, which will almost always use all components as much as
+   they can, and hence will produce wildly different solutions for
+   different numbers of components, the Dirichlet process solution
+   won't change much with changes to the parameters, leading to more
+   stability and less tuning.
+:No need to specify the number of components: only an upper bound of
+   this number needs to be provided. Note however that the DPMM is not
+   a formal model selection procedure, and thus provides no guarantee
+   on the result.
+Cons
+.....
+:Speed: the extra parametrization necessary for variational
+   inference and for the structure of the Dirichlet process can and
+   will make inference slower, although not by much.
+:Bias: as in variational techniques, but only more so, there are
+   many implicit biases in the Dirichlet process and the inference
+   algorithms, and whenever there is a mismatch between these biases
+   and the data it might be possible to fit better models using a
+   finite mixture.
 .. _dirichlet_process:
 The Dirichlet Process
 ---------------------
-Here we will talk only about using variational inference algorithms on
+Here we describe variational inference algorithms on Dirichlet process
-Dirichlet process mixtures, for reasons of simplicity.
+mixtures.
 One of the main advantages of variational techniques is that they can
 incorporate prior information to the model in many different ways. The
@@ -274,31 +307,9 @@ on the number of mixture components (this upper bound, assuming it is
 higher than the "true" number of components, affects only algorithmic
 complexity, not the actual number of components used).
-The advantages of using a Dirichlet process mixture model are:
+.. toctree::
+    :hidden:
-:Less sensitivity to the number of parameters: unlike finite
-   models, which will almost always use all components as much as
-   they can, and hence will produce wildly different solutions for
-   different numbers of components, the Dirichlet process solution
-   won't change much with changes to the parameters, leading to more
-   stability and less tuning.
-:No need to specify the number of components: only an upper bound of
-   this number needs to be provided. Note however that the DPMM is not
-   a formal model selection procedure, and thus provides no guarantee
-   on the result.
-The main disadvantages of using the Dirichlet process are:
-:Speed: the extra parametrization necessary for variational
-   inference and for the structure of the Dirichlet process can and
-   will make inference slower, although not by much.
-:Bias: as in variational techniques, but only more so, there are
-   many implicit biases in the Dirichlet process and the inference
-   algorithms, and whenever there is a mismatch between these biases
-   and the data it might be possible to fit better models using a
-   finite mixture.
+    dp-derivation.rst