From 80bfde5b2de3221f6ce688da195223ffb39da424 Mon Sep 17 00:00:00 2001
From: Olivier Grisel <olivier.grisel@ensta.org>
Date: Mon, 13 Dec 2010 16:10:47 +0100
Subject: [PATCH] cosmit (reST formatting of the SGD module documentation)

---
 doc/modules/sgd.rst | 259 +++++++++++++++++++++++---------------------
 1 file changed, 136 insertions(+), 123 deletions(-)

diff --git a/doc/modules/sgd.rst b/doc/modules/sgd.rst
index 2e202eb4ab..d6f21ae9ea 100644
--- a/doc/modules/sgd.rst
+++ b/doc/modules/sgd.rst
@@ -7,50 +7,52 @@ Stochastic Gradient Descent
 
 .. currentmodule:: scikits.learn.linear_model
 
-**Stochastic Gradient Descent (SGD)** is a simple yet very efficient approach 
-to discriminative learning of linear classifiers under convex loss functions 
-such as (linear) `Support Vector Machines <http://en.wikipedia.org/wiki/Support_vector_machine>`_ and `Logistic Regression <http://en.wikipedia.org/wiki/Logistic_regression>`_. 
-Even though SGD has been around in the machine learning community for a long time, 
-it has received a considerable amount of attention just recently in the 
-context of large-scale learning. 
-
-SGD has been successfully applied to large-scale and sparse machine learning 
-problems often encountered in text classification and natural language 
-processing. 
-Given that the data is sparse, the classifiers in this module easily scale 
-to problems with more than 10^5 training examples and more than 10^4 features. 
+**Stochastic Gradient Descent (SGD)** is a simple yet very efficient
+approach to discriminative learning of linear classifiers under
+convex loss functions such as (linear) `Support Vector Machines
+<http://en.wikipedia.org/wiki/Support_vector_machine>`_ and `Logistic
+Regression <http://en.wikipedia.org/wiki/Logistic_regression>`_.
+Even though SGD has been around in the machine learning community for
+a long time, it has received a considerable amount of attention just
+recently in the context of large-scale learning.
+
+SGD has been successfully applied to large-scale and sparse machine
+learning problems often encountered in text classification and natural
+language processing.  Given that the data is sparse, the classifiers
+in this module easily scale to problems with more than 10^5 training
+examples and more than 10^4 features.
 
 The advantages of Stochastic Gradient Descent are:
 
     - Efficiency.
 
-    - Ease of implementation (lots of opportunities for code tuning). 
+    - Ease of implementation (lots of opportunities for code tuning).
 
 The disadvantages of Stochastic Gradient Descent include:
 
     - SGD requires a number of hyperparameters including the regularization
-      parameter and the number of iterations. 
+      parameter and the number of iterations.
 
-    - SGD is sensitive to feature scaling. 
+    - SGD is sensitive to feature scaling.
 
 Classification
 ==============
 
-.. warning:: Make sure you permute (shuffle) your training data before fitting the model or use `shuffle=True` to shuffle after each iterations. 
+.. warning:: Make sure you permute (shuffle) your training data before fitting the model or use `shuffle=True` to shuffle after each iterations.
 
-The class :class:`SGDClassifier` implements a plain stochastic gradient descent 
-learning routine which supports different loss functions and penalties for 
-classification.
+The class :class:`SGDClassifier` implements a plain stochastic gradient
+descent learning routine which supports different loss functions and
+penalties for classification.
 
 .. figure:: ../auto_examples/linear_model/images/plot_sgd_separating_hyperplane.png
    :target: ../auto_examples/linear_model/plot_sgd_separating_hyperplane.html
    :align: center
    :scale: 75
 
-As other classifiers, SGD has to be fitted with two arrays:
-an array X of size [n_samples, n_features] holding the training
-samples, and an array Y of size [n_samples] holding the target values
-(class labels) for the training samples::
+As other classifiers, SGD has to be fitted with two arrays: an array X
+of size [n_samples, n_features] holding the training samples, and an
+array Y of size [n_samples] holding the target values (class labels)
+for the training samples::
 
     >>> from scikits.learn.linear_model import SGDClassifier
     >>> X = [[0., 0.], [1., 1.]]
@@ -65,8 +67,8 @@ After being fitted, the model can then be used to predict new values::
     >>> clf.predict([[2., 2.]])
     array([ 1.])
 
-SGD fits a linear model to the training data. The member `coef_` holds the 
-model parameters:
+SGD fits a linear model to the training data. The member `coef_` holds
+the model parameters:
 
     >>> clf.coef_
     array([ 9.90090187,  9.90090187])
@@ -76,61 +78,65 @@ Member `intercept_` holds the intercept (aka offset or bias):
     >>> clf.intercept_
     array(-9.9900299301496904)
 
-Whether or not the model should use an intercept, i.e. a biased hyperplane, is 
-controlled by the parameter `fit_intercept`.
+Whether or not the model should use an intercept, i.e. a biased
+hyperplane, is controlled by the parameter `fit_intercept`.
 
 To get the signed distance to the hyperplane use `decision_function`:
 
     >>> clf.decision_function([[2., 2.]])
     array([ 29.61357756])
 
-The concrete loss function can be set via the `loss` parameter. :class:`SGDClassifier` supports the
-following loss functions: 
+The concrete loss function can be set via the `loss`
+parameter. :class:`SGDClassifier` supports the following loss functions:
 
   - `loss="hinge"`: (soft-margin) linear Support Vector Machine.
-  - `loss="modified_huber"`: smoothed hinge loss. 
+  - `loss="modified_huber"`: smoothed hinge loss.
   - `loss="log"`: Logistic Regression
 
-The first two loss functions are lazy, they only update the model parameters if 
-an example violates the margin constraint, which makes training very efficient. 
-Log loss, on the other hand, provides probability estimates.
+The first two loss functions are lazy, they only update the model
+parameters if an example violates the margin constraint, which makes
+training very efficient.  Log loss, on the other hand, provides
+probability estimates.
 
-In the case of binary classification and `loss="log"` you get a probability 
-estimate P(y=C|x) using `predict_proba`, where `C` is the largest class label: 
-   
-    >>> clf = SGDClassifier(loss="log").fit(X, y)
-    >>> clf.predict_proba([[1., 1.]])
-    array([ 0.99999949])
+In the case of binary classification and `loss="log"` you get a
+probability estimate P(y=C|x) using `predict_proba`, where `C` is the
+largest class label:
 
-The concrete penalty can be set via the `penalty` parameter. `SGD` supports the
-following penalties: 
+    >>> clf = SGDClassifier(loss="log").fit(X, y) >>>
+    clf.predict_proba([[1., 1.]])  array([ 0.99999949])
+
+The concrete penalty can be set via the `penalty` parameter. `SGD`
+supports the following penalties:
 
   - `penalty="l2"`: L2 norm penalty on `coef_`.
   - `penalty="l1"`: L1 norm penalty on `coef_`.
-  - `penalty="elasticnet"`: Convex combination of L2 and L1; `rho * L2 + (1 - rho) * L1`. 
-
-The default setting is `penalty="l2"`. The L1 penalty leads to sparse solutions, 
-driving most coefficients to zero. The Elastic Net solves some deficiencies of 
-the L1 penalty in the presence of highly correlated attributes. The parameter `rho`
-has to be specified by the user. 
-
-:class:`SGDClassifier` supports multi-class classification by combining multiple 
-binary classifiers in a "one versus all" (OVA) scheme. For each of the `K` classes, 
-a binary classifier is learned that discriminates between that and all other `K-1`
-classes. At testing time, we compute the confidence score (i.e. the signed distances 
-to the hyperplane) for each classifier and choose the class with the highest 
-confidence. The Figure below illustrates the OVA approach on the iris dataset. 
-The dashed lines represent the three OVA classifiers; 
-the background colors show the decision surface induced by the three classifiers. 
+  - `penalty="elasticnet"`: Convex combination of L2 and L1; `rho * L2 + (1 - rho) * L1`.
+
+The default setting is `penalty="l2"`. The L1 penalty leads to sparse
+solutions, driving most coefficients to zero. The Elastic Net solves
+some deficiencies of the L1 penalty in the presence of highly correlated
+attributes. The parameter `rho` has to be specified by the user.
+
+:class:`SGDClassifier` supports multi-class classification by combining
+multiple binary classifiers in a "one versus all" (OVA) scheme. For each
+of the `K` classes, a binary classifier is learned that discriminates
+between that and all other `K-1` classes. At testing time, we compute the
+confidence score (i.e. the signed distances to the hyperplane) for each
+classifier and choose the class with the highest confidence. The Figure
+below illustrates the OVA approach on the iris dataset.  The dashed
+lines represent the three OVA classifiers; the background colors show
+the decision surface induced by the three classifiers.
 
 .. figure:: ../auto_examples/linear_model/images/plot_sgd_iris.png
    :target: ../auto_examples/linear_model/plot_sgd_iris.html
    :align: center
    :scale: 75
 
-In the case of multi-class classification `coef_` is a two-dimensionaly array of shape
-[n_classes, n_features] and `intercept_` is a one dimensional array of shape [n_classes]. The i-th row of `coef_` holds the weight vector of the OVA classifier for the i-th 
-class; classes are indexed in ascending order (see member `classes`). 
+In the case of multi-class classification `coef_` is a two-dimensionaly
+array of shape [n_classes, n_features] and `intercept_` is a one
+dimensional array of shape [n_classes]. The i-th row of `coef_` holds
+the weight vector of the OVA classifier for the i-th class; classes are
+indexed in ascending order (see member `classes`).
 
 .. topic:: Examples:
 
@@ -140,17 +146,17 @@ class; classes are indexed in ascending order (see member `classes`).
 Regression
 ==========
 
-The class :class:`SGDRegressor` implements a plain stochastic gradient descent learning 
-routine which supports different loss functions and penalties to fit linear regression
-models. 
+The class :class:`SGDRegressor` implements a plain stochastic gradient
+descent learning routine which supports different loss functions and
+penalties to fit linear regression models.
 
 .. figure:: ../auto_examples/linear_model/images/plot_sgd_ols.png
    :target: ../auto_examples/linear_model/plot_sgd_ols.html
    :align: center
    :scale: 75
 
-The concrete loss function can be set via the `loss` parameter. :class:`SGDRegressor` supports the
-following loss functions: 
+The concrete loss function can be set via the `loss`
+parameter. :class:`SGDRegressor` supports the following loss functions:
 
   - `loss="squared_loss"`: Ordinary least squares.
   - `loss="huber"`: Huber loss for robust regression.
@@ -165,7 +171,9 @@ following loss functions:
 Stochastic Gradient Descent for sparse data
 ===========================================
 
-.. note:: The sparse implementation produces slightly different results than the dense implementation due to a shrunk learning rate for the intercept.
+.. note:: The sparse implementation produces slightly different results
+  than the dense implementation due to a shrunk learning rate for the
+  intercept.
 
 There is support for sparse data given in any matrix in a format
 supported by scipy.sparse. Classes have the same name, just prefixed
@@ -186,14 +194,14 @@ Implemented classes are :class:`SGDClassifier` and :class:`SGDRegressor`.
 Complexity
 ==========
 
-The major advantage of SGD is its efficiency, which is basically 
-linear in the number of training examples. If X is a matrix of size (n, p) 
-training has a cost of :math:`O(k n \bar p)`, where k is the number 
-of iterations (epochs) and :math:`\bar p` is the average number of 
-non-zero attributes per sample. 
+The major advantage of SGD is its efficiency, which is basically
+linear in the number of training examples. If X is a matrix of size (n, p)
+training has a cost of :math:`O(k n \bar p)`, where k is the number
+of iterations (epochs) and :math:`\bar p` is the average number of
+non-zero attributes per sample.
 
-Recent theoretical results, however, show that the runtime to get some 
-desired optimization accuracy does not increase as the training set size increases. 
+Recent theoretical results, however, show that the runtime to get some
+desired optimization accuracy does not increase as the training set size increases.
 
 Tips on Practical Use
 =====================
@@ -206,51 +214,52 @@ Tips on Practical Use
     results. See `The CookBook
     <https://sourceforge.net/apps/trac/scikit-learn/wiki/CookBook>`_
     for some examples on scaling. If your attributes have an intrinsic
-    scale (e.g. word frequencies or indicator features) scaling is 
-    not needed. 
+    scale (e.g. word frequencies or indicator features) scaling is
+    not needed.
 
-  * Finding a reasonable regularization term :math:`\alpha` is 
+  * Finding a reasonable regularization term :math:`\alpha` is
     best done using grid search `for alpha in 10.0**-np.arange(1,7)`.
 
-  * Empirically, we found that SGD converges after observing 
-    approx. 10^6 training samples. Thus, a reasonable first guess 
-    for the number of iterations is `n_iter = np.ceil(10**6 / n)`, 
+  * Empirically, we found that SGD converges after observing
+    approx. 10^6 training samples. Thus, a reasonable first guess
+    for the number of iterations is `n_iter = np.ceil(10**6 / n)`,
     where `n` is the size of the training set.
 
 .. topic:: References:
 
- * `"Efficient BackProp" <yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf>`_ 
-   Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998.
+ * `"Efficient BackProp" <yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf>`_
+   Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks
+   of the Trade 1998.
 
 .. _sgd_mathematical_formulation:
 
 Mathematical formulation
 ========================
 
-Given a set of training examples :math:`(x_1, y_1), \ldots, (x_n, y_n)` where 
-:math:`x_i \in \mathbf{R}^n` and :math:`y_i \in \{-1,1\}`, our goal is to 
+Given a set of training examples :math:`(x_1, y_1), \ldots, (x_n, y_n)` where
+:math:`x_i \in \mathbf{R}^n` and :math:`y_i \in \{-1,1\}`, our goal is to
 learn a linear scoring function :math:`f(x) = w^T x + b` with model parameters
 :math:`w \in \mathbf{R}^m` and intercept :math:`b \in \mathbf{R}`. In order
 to make predictions, we simply look at the sign of :math:`f(x)`.
-A common choice to find the model parameters is by minimizing the regularized 
+A common choice to find the model parameters is by minimizing the regularized
 training error given by
 
 .. math::
 
     E(w,b) = \sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w)
 
-where :math:`L` is a loss function that measures model (mis)fit and :math:`R` is a
-regularization term (aka penalty) that penalizes model complexity; :math:`\alpha > 0`
-is a non-negative hyperparameter. 
+where :math:`L` is a loss function that measures model (mis)fit and
+:math:`R` is a regularization term (aka penalty) that penalizes model
+complexity; :math:`\alpha > 0` is a non-negative hyperparameter.
 
-Different choices for :math:`L` entail different classifiers such as 
+Different choices for :math:`L` entail different classifiers such as
 
    - Hinge: (soft-margin) Support Vector Machines.
    - Log:   Logistic Regression.
-   - Least-Squares: Ridge Regression. 
+   - Least-Squares: Ridge Regression.
 
-All of the above loss functions can be regarded as an upper bound on the 
-misclassification error (Zero-one loss) as shown in the Figure below. 
+All of the above loss functions can be regarded as an upper bound on the
+misclassification error (Zero-one loss) as shown in the Figure below.
 
 .. figure:: ../auto_examples/linear_model/images/plot_sgd_loss_functions.png
    :align: center
@@ -258,12 +267,13 @@ misclassification error (Zero-one loss) as shown in the Figure below.
 
 Popular choices for the regularization term :math:`R` include:
 
-   - L2 norm: :math:`R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2`, 
-   - L1 norm: :math:`R(w) := \sum_{i=1}^{n} |w_i|`, which leadsin sparse solutions.
-   - Elastic Net: :math:`R(w) := \rho \frac{1}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i|`, a convex combination of L2 and L1. 
+   - L2 norm: :math:`R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2`,
+   - L1 norm: :math:`R(w) := \sum_{i=1}^{n} |w_i|`, which leads to sparse
+     solutions.
+   - Elastic Net: :math:`R(w) := \rho \frac{1}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i|`, a convex combination of L2 and L1.
 
-The Figure below shows the contours of the different regularization terms 
-in the parameter space when :math:`R(w) = 1`. 
+The Figure below shows the contours of the different regularization terms
+in the parameter space when :math:`R(w) = 1`.
 
 .. figure:: ../auto_examples/linear_model/images/plot_sgd_penalties.png
    :align: center
@@ -272,25 +282,26 @@ in the parameter space when :math:`R(w) = 1`.
 SGD
 ---
 
-Stochastic gradient descent is an optimization method for unconstrained 
+Stochastic gradient descent is an optimization method for unconstrained
 optimization problems. In contrast to (batch) gradient descent, SGD
-approximates the true gradient of :math:`E(w,b)` by considering a 
-single training example at a time. 
+approximates the true gradient of :math:`E(w,b)` by considering a
+single training example at a time.
 
-The class :class:`SGDClassifier` implements a first-order SGD learning routine. 
-The algorithm iterates over the training examples and for each example 
-updates the model parameters according to the update rule given by
+The class :class:`SGDClassifier` implements a first-order SGD learning
+routine.  The algorithm iterates over the training examples and for each
+example updates the model parameters according to the update rule given by
 
 .. math::
 
-    w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w} 
+    w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w}
     + \frac{\partial L(w^T x_i + b, y_i)}{\partial w})
 
-where :math:`\eta` is the learning rate which controls the step-size 
-in the parameter space. 
-The intercept :math:`b` is updated similarly but without regularization.
+where :math:`\eta` is the learning rate which controls the step-size in
+the parameter space.  The intercept :math:`b` is updated similarly but
+without regularization.
 
-The model parameters can be accessed through the members coef\_ and intercept\_:
+The model parameters can be accessed through the members coef\_ and
+intercept\_:
 
      - Member coef\_ holds the weights :math:`w`
 
@@ -298,37 +309,39 @@ The model parameters can be accessed through the members coef\_ and intercept\_:
 
 .. topic:: References:
 
- * `"Solving large scale linear prediction problems using stochastic gradient descent algorithms"
-   <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377>`_ 
+ * `"Solving large scale linear prediction problems using stochastic
+   gradient descent algorithms"
+   <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377>`_
    T. Zhang - In Proceedings of ICML '04.
-   
+
  * `"Regularization and variable selection via the elastic net"
-   <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696>`_ 
-   H. Zou, T. Hastie - Journal of the Royal Statistical Society Series B, 67 (2), 301-320.
+   <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696>`_
+   H. Zou, T. Hastie - Journal of the Royal Statistical Society Series B,
+   67 (2), 301-320.
 
 
 Implementation details
 ======================
 
-The implementation of SGD is influenced by the `Stochastic Gradient SVM 
-<http://leon.bottou.org/projects/sgd>`_  of Léon Bottou. Similar to SvmSGD, 
-the weight vector is represented as the product of a scalar and a vector 
-which allows an efficient weight update in the case of L2 regularization. 
-In the case of sparse feature vectors, the intercept is updated with a 
-smaller learning rate (multiplied by 0.01) to account for the fact that 
+The implementation of SGD is influenced by the `Stochastic Gradient SVM
+<http://leon.bottou.org/projects/sgd>`_  of Léon Bottou. Similar to SvmSGD,
+the weight vector is represented as the product of a scalar and a vector
+which allows an efficient weight update in the case of L2 regularization.
+In the case of sparse feature vectors, the intercept is updated with a
+smaller learning rate (multiplied by 0.01) to account for the fact that
 it is updated more frequently. Training examples are picked up sequentially
-and the learning rate is lowered after each observed example. We adopted the 
-learning rate schedule from Shalev-Shwartz et al. 2007. 
-For multi-class classification, a "one versus all" approach is used. 
-We use the truncated gradient algorithm proposed by Tsuruoka et al. 2009 
-for L1 regularization (and the Elastic Net). 
+and the learning rate is lowered after each observed example. We adopted the
+learning rate schedule from Shalev-Shwartz et al. 2007.
+For multi-class classification, a "one versus all" approach is used.
+We use the truncated gradient algorithm proposed by Tsuruoka et al. 2009
+for L1 regularization (and the Elastic Net).
 The code is written in Cython.
 
 .. topic:: References:
 
  * `"Stochastic Gradient Descent" <http://leon.bottou.org/projects/sgd>`_ L. Bottou - Website, 2010.
 
- * `"Pegasos: Primal estimated sub-gradient solver for svm" 
+ * `"Pegasos: Primal estimated sub-gradient solver for svm"
    <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.8513>`_
    S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML '07.
 
-- 
GitLab