From 80bfde5b2de3221f6ce688da195223ffb39da424 Mon Sep 17 00:00:00 2001 From: Olivier Grisel <olivier.grisel@ensta.org> Date: Mon, 13 Dec 2010 16:10:47 +0100 Subject: [PATCH] cosmit (reST formatting of the SGD module documentation) --- doc/modules/sgd.rst | 259 +++++++++++++++++++++++--------------------- 1 file changed, 136 insertions(+), 123 deletions(-) diff --git a/doc/modules/sgd.rst b/doc/modules/sgd.rst index 2e202eb4ab..d6f21ae9ea 100644 --- a/doc/modules/sgd.rst +++ b/doc/modules/sgd.rst @@ -7,50 +7,52 @@ Stochastic Gradient Descent .. currentmodule:: scikits.learn.linear_model -**Stochastic Gradient Descent (SGD)** is a simple yet very efficient approach -to discriminative learning of linear classifiers under convex loss functions -such as (linear) `Support Vector Machines <http://en.wikipedia.org/wiki/Support_vector_machine>`_ and `Logistic Regression <http://en.wikipedia.org/wiki/Logistic_regression>`_. -Even though SGD has been around in the machine learning community for a long time, -it has received a considerable amount of attention just recently in the -context of large-scale learning. - -SGD has been successfully applied to large-scale and sparse machine learning -problems often encountered in text classification and natural language -processing. -Given that the data is sparse, the classifiers in this module easily scale -to problems with more than 10^5 training examples and more than 10^4 features. +**Stochastic Gradient Descent (SGD)** is a simple yet very efficient +approach to discriminative learning of linear classifiers under +convex loss functions such as (linear) `Support Vector Machines +<http://en.wikipedia.org/wiki/Support_vector_machine>`_ and `Logistic +Regression <http://en.wikipedia.org/wiki/Logistic_regression>`_. +Even though SGD has been around in the machine learning community for +a long time, it has received a considerable amount of attention just +recently in the context of large-scale learning. + +SGD has been successfully applied to large-scale and sparse machine +learning problems often encountered in text classification and natural +language processing. Given that the data is sparse, the classifiers +in this module easily scale to problems with more than 10^5 training +examples and more than 10^4 features. The advantages of Stochastic Gradient Descent are: - Efficiency. - - Ease of implementation (lots of opportunities for code tuning). + - Ease of implementation (lots of opportunities for code tuning). The disadvantages of Stochastic Gradient Descent include: - SGD requires a number of hyperparameters including the regularization - parameter and the number of iterations. + parameter and the number of iterations. - - SGD is sensitive to feature scaling. + - SGD is sensitive to feature scaling. Classification ============== -.. warning:: Make sure you permute (shuffle) your training data before fitting the model or use `shuffle=True` to shuffle after each iterations. +.. warning:: Make sure you permute (shuffle) your training data before fitting the model or use `shuffle=True` to shuffle after each iterations. -The class :class:`SGDClassifier` implements a plain stochastic gradient descent -learning routine which supports different loss functions and penalties for -classification. +The class :class:`SGDClassifier` implements a plain stochastic gradient +descent learning routine which supports different loss functions and +penalties for classification. .. figure:: ../auto_examples/linear_model/images/plot_sgd_separating_hyperplane.png :target: ../auto_examples/linear_model/plot_sgd_separating_hyperplane.html :align: center :scale: 75 -As other classifiers, SGD has to be fitted with two arrays: -an array X of size [n_samples, n_features] holding the training -samples, and an array Y of size [n_samples] holding the target values -(class labels) for the training samples:: +As other classifiers, SGD has to be fitted with two arrays: an array X +of size [n_samples, n_features] holding the training samples, and an +array Y of size [n_samples] holding the target values (class labels) +for the training samples:: >>> from scikits.learn.linear_model import SGDClassifier >>> X = [[0., 0.], [1., 1.]] @@ -65,8 +67,8 @@ After being fitted, the model can then be used to predict new values:: >>> clf.predict([[2., 2.]]) array([ 1.]) -SGD fits a linear model to the training data. The member `coef_` holds the -model parameters: +SGD fits a linear model to the training data. The member `coef_` holds +the model parameters: >>> clf.coef_ array([ 9.90090187, 9.90090187]) @@ -76,61 +78,65 @@ Member `intercept_` holds the intercept (aka offset or bias): >>> clf.intercept_ array(-9.9900299301496904) -Whether or not the model should use an intercept, i.e. a biased hyperplane, is -controlled by the parameter `fit_intercept`. +Whether or not the model should use an intercept, i.e. a biased +hyperplane, is controlled by the parameter `fit_intercept`. To get the signed distance to the hyperplane use `decision_function`: >>> clf.decision_function([[2., 2.]]) array([ 29.61357756]) -The concrete loss function can be set via the `loss` parameter. :class:`SGDClassifier` supports the -following loss functions: +The concrete loss function can be set via the `loss` +parameter. :class:`SGDClassifier` supports the following loss functions: - `loss="hinge"`: (soft-margin) linear Support Vector Machine. - - `loss="modified_huber"`: smoothed hinge loss. + - `loss="modified_huber"`: smoothed hinge loss. - `loss="log"`: Logistic Regression -The first two loss functions are lazy, they only update the model parameters if -an example violates the margin constraint, which makes training very efficient. -Log loss, on the other hand, provides probability estimates. +The first two loss functions are lazy, they only update the model +parameters if an example violates the margin constraint, which makes +training very efficient. Log loss, on the other hand, provides +probability estimates. -In the case of binary classification and `loss="log"` you get a probability -estimate P(y=C|x) using `predict_proba`, where `C` is the largest class label: - - >>> clf = SGDClassifier(loss="log").fit(X, y) - >>> clf.predict_proba([[1., 1.]]) - array([ 0.99999949]) +In the case of binary classification and `loss="log"` you get a +probability estimate P(y=C|x) using `predict_proba`, where `C` is the +largest class label: -The concrete penalty can be set via the `penalty` parameter. `SGD` supports the -following penalties: + >>> clf = SGDClassifier(loss="log").fit(X, y) >>> + clf.predict_proba([[1., 1.]]) array([ 0.99999949]) + +The concrete penalty can be set via the `penalty` parameter. `SGD` +supports the following penalties: - `penalty="l2"`: L2 norm penalty on `coef_`. - `penalty="l1"`: L1 norm penalty on `coef_`. - - `penalty="elasticnet"`: Convex combination of L2 and L1; `rho * L2 + (1 - rho) * L1`. - -The default setting is `penalty="l2"`. The L1 penalty leads to sparse solutions, -driving most coefficients to zero. The Elastic Net solves some deficiencies of -the L1 penalty in the presence of highly correlated attributes. The parameter `rho` -has to be specified by the user. - -:class:`SGDClassifier` supports multi-class classification by combining multiple -binary classifiers in a "one versus all" (OVA) scheme. For each of the `K` classes, -a binary classifier is learned that discriminates between that and all other `K-1` -classes. At testing time, we compute the confidence score (i.e. the signed distances -to the hyperplane) for each classifier and choose the class with the highest -confidence. The Figure below illustrates the OVA approach on the iris dataset. -The dashed lines represent the three OVA classifiers; -the background colors show the decision surface induced by the three classifiers. + - `penalty="elasticnet"`: Convex combination of L2 and L1; `rho * L2 + (1 - rho) * L1`. + +The default setting is `penalty="l2"`. The L1 penalty leads to sparse +solutions, driving most coefficients to zero. The Elastic Net solves +some deficiencies of the L1 penalty in the presence of highly correlated +attributes. The parameter `rho` has to be specified by the user. + +:class:`SGDClassifier` supports multi-class classification by combining +multiple binary classifiers in a "one versus all" (OVA) scheme. For each +of the `K` classes, a binary classifier is learned that discriminates +between that and all other `K-1` classes. At testing time, we compute the +confidence score (i.e. the signed distances to the hyperplane) for each +classifier and choose the class with the highest confidence. The Figure +below illustrates the OVA approach on the iris dataset. The dashed +lines represent the three OVA classifiers; the background colors show +the decision surface induced by the three classifiers. .. figure:: ../auto_examples/linear_model/images/plot_sgd_iris.png :target: ../auto_examples/linear_model/plot_sgd_iris.html :align: center :scale: 75 -In the case of multi-class classification `coef_` is a two-dimensionaly array of shape -[n_classes, n_features] and `intercept_` is a one dimensional array of shape [n_classes]. The i-th row of `coef_` holds the weight vector of the OVA classifier for the i-th -class; classes are indexed in ascending order (see member `classes`). +In the case of multi-class classification `coef_` is a two-dimensionaly +array of shape [n_classes, n_features] and `intercept_` is a one +dimensional array of shape [n_classes]. The i-th row of `coef_` holds +the weight vector of the OVA classifier for the i-th class; classes are +indexed in ascending order (see member `classes`). .. topic:: Examples: @@ -140,17 +146,17 @@ class; classes are indexed in ascending order (see member `classes`). Regression ========== -The class :class:`SGDRegressor` implements a plain stochastic gradient descent learning -routine which supports different loss functions and penalties to fit linear regression -models. +The class :class:`SGDRegressor` implements a plain stochastic gradient +descent learning routine which supports different loss functions and +penalties to fit linear regression models. .. figure:: ../auto_examples/linear_model/images/plot_sgd_ols.png :target: ../auto_examples/linear_model/plot_sgd_ols.html :align: center :scale: 75 -The concrete loss function can be set via the `loss` parameter. :class:`SGDRegressor` supports the -following loss functions: +The concrete loss function can be set via the `loss` +parameter. :class:`SGDRegressor` supports the following loss functions: - `loss="squared_loss"`: Ordinary least squares. - `loss="huber"`: Huber loss for robust regression. @@ -165,7 +171,9 @@ following loss functions: Stochastic Gradient Descent for sparse data =========================================== -.. note:: The sparse implementation produces slightly different results than the dense implementation due to a shrunk learning rate for the intercept. +.. note:: The sparse implementation produces slightly different results + than the dense implementation due to a shrunk learning rate for the + intercept. There is support for sparse data given in any matrix in a format supported by scipy.sparse. Classes have the same name, just prefixed @@ -186,14 +194,14 @@ Implemented classes are :class:`SGDClassifier` and :class:`SGDRegressor`. Complexity ========== -The major advantage of SGD is its efficiency, which is basically -linear in the number of training examples. If X is a matrix of size (n, p) -training has a cost of :math:`O(k n \bar p)`, where k is the number -of iterations (epochs) and :math:`\bar p` is the average number of -non-zero attributes per sample. +The major advantage of SGD is its efficiency, which is basically +linear in the number of training examples. If X is a matrix of size (n, p) +training has a cost of :math:`O(k n \bar p)`, where k is the number +of iterations (epochs) and :math:`\bar p` is the average number of +non-zero attributes per sample. -Recent theoretical results, however, show that the runtime to get some -desired optimization accuracy does not increase as the training set size increases. +Recent theoretical results, however, show that the runtime to get some +desired optimization accuracy does not increase as the training set size increases. Tips on Practical Use ===================== @@ -206,51 +214,52 @@ Tips on Practical Use results. See `The CookBook <https://sourceforge.net/apps/trac/scikit-learn/wiki/CookBook>`_ for some examples on scaling. If your attributes have an intrinsic - scale (e.g. word frequencies or indicator features) scaling is - not needed. + scale (e.g. word frequencies or indicator features) scaling is + not needed. - * Finding a reasonable regularization term :math:`\alpha` is + * Finding a reasonable regularization term :math:`\alpha` is best done using grid search `for alpha in 10.0**-np.arange(1,7)`. - * Empirically, we found that SGD converges after observing - approx. 10^6 training samples. Thus, a reasonable first guess - for the number of iterations is `n_iter = np.ceil(10**6 / n)`, + * Empirically, we found that SGD converges after observing + approx. 10^6 training samples. Thus, a reasonable first guess + for the number of iterations is `n_iter = np.ceil(10**6 / n)`, where `n` is the size of the training set. .. topic:: References: - * `"Efficient BackProp" <yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf>`_ - Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998. + * `"Efficient BackProp" <yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf>`_ + Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks + of the Trade 1998. .. _sgd_mathematical_formulation: Mathematical formulation ======================== -Given a set of training examples :math:`(x_1, y_1), \ldots, (x_n, y_n)` where -:math:`x_i \in \mathbf{R}^n` and :math:`y_i \in \{-1,1\}`, our goal is to +Given a set of training examples :math:`(x_1, y_1), \ldots, (x_n, y_n)` where +:math:`x_i \in \mathbf{R}^n` and :math:`y_i \in \{-1,1\}`, our goal is to learn a linear scoring function :math:`f(x) = w^T x + b` with model parameters :math:`w \in \mathbf{R}^m` and intercept :math:`b \in \mathbf{R}`. In order to make predictions, we simply look at the sign of :math:`f(x)`. -A common choice to find the model parameters is by minimizing the regularized +A common choice to find the model parameters is by minimizing the regularized training error given by .. math:: E(w,b) = \sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w) -where :math:`L` is a loss function that measures model (mis)fit and :math:`R` is a -regularization term (aka penalty) that penalizes model complexity; :math:`\alpha > 0` -is a non-negative hyperparameter. +where :math:`L` is a loss function that measures model (mis)fit and +:math:`R` is a regularization term (aka penalty) that penalizes model +complexity; :math:`\alpha > 0` is a non-negative hyperparameter. -Different choices for :math:`L` entail different classifiers such as +Different choices for :math:`L` entail different classifiers such as - Hinge: (soft-margin) Support Vector Machines. - Log: Logistic Regression. - - Least-Squares: Ridge Regression. + - Least-Squares: Ridge Regression. -All of the above loss functions can be regarded as an upper bound on the -misclassification error (Zero-one loss) as shown in the Figure below. +All of the above loss functions can be regarded as an upper bound on the +misclassification error (Zero-one loss) as shown in the Figure below. .. figure:: ../auto_examples/linear_model/images/plot_sgd_loss_functions.png :align: center @@ -258,12 +267,13 @@ misclassification error (Zero-one loss) as shown in the Figure below. Popular choices for the regularization term :math:`R` include: - - L2 norm: :math:`R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2`, - - L1 norm: :math:`R(w) := \sum_{i=1}^{n} |w_i|`, which leadsin sparse solutions. - - Elastic Net: :math:`R(w) := \rho \frac{1}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i|`, a convex combination of L2 and L1. + - L2 norm: :math:`R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2`, + - L1 norm: :math:`R(w) := \sum_{i=1}^{n} |w_i|`, which leads to sparse + solutions. + - Elastic Net: :math:`R(w) := \rho \frac{1}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i|`, a convex combination of L2 and L1. -The Figure below shows the contours of the different regularization terms -in the parameter space when :math:`R(w) = 1`. +The Figure below shows the contours of the different regularization terms +in the parameter space when :math:`R(w) = 1`. .. figure:: ../auto_examples/linear_model/images/plot_sgd_penalties.png :align: center @@ -272,25 +282,26 @@ in the parameter space when :math:`R(w) = 1`. SGD --- -Stochastic gradient descent is an optimization method for unconstrained +Stochastic gradient descent is an optimization method for unconstrained optimization problems. In contrast to (batch) gradient descent, SGD -approximates the true gradient of :math:`E(w,b)` by considering a -single training example at a time. +approximates the true gradient of :math:`E(w,b)` by considering a +single training example at a time. -The class :class:`SGDClassifier` implements a first-order SGD learning routine. -The algorithm iterates over the training examples and for each example -updates the model parameters according to the update rule given by +The class :class:`SGDClassifier` implements a first-order SGD learning +routine. The algorithm iterates over the training examples and for each +example updates the model parameters according to the update rule given by .. math:: - w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w} + w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w} + \frac{\partial L(w^T x_i + b, y_i)}{\partial w}) -where :math:`\eta` is the learning rate which controls the step-size -in the parameter space. -The intercept :math:`b` is updated similarly but without regularization. +where :math:`\eta` is the learning rate which controls the step-size in +the parameter space. The intercept :math:`b` is updated similarly but +without regularization. -The model parameters can be accessed through the members coef\_ and intercept\_: +The model parameters can be accessed through the members coef\_ and +intercept\_: - Member coef\_ holds the weights :math:`w` @@ -298,37 +309,39 @@ The model parameters can be accessed through the members coef\_ and intercept\_: .. topic:: References: - * `"Solving large scale linear prediction problems using stochastic gradient descent algorithms" - <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377>`_ + * `"Solving large scale linear prediction problems using stochastic + gradient descent algorithms" + <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7377>`_ T. Zhang - In Proceedings of ICML '04. - + * `"Regularization and variable selection via the elastic net" - <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696>`_ - H. Zou, T. Hastie - Journal of the Royal Statistical Society Series B, 67 (2), 301-320. + <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696>`_ + H. Zou, T. Hastie - Journal of the Royal Statistical Society Series B, + 67 (2), 301-320. Implementation details ====================== -The implementation of SGD is influenced by the `Stochastic Gradient SVM -<http://leon.bottou.org/projects/sgd>`_ of Léon Bottou. Similar to SvmSGD, -the weight vector is represented as the product of a scalar and a vector -which allows an efficient weight update in the case of L2 regularization. -In the case of sparse feature vectors, the intercept is updated with a -smaller learning rate (multiplied by 0.01) to account for the fact that +The implementation of SGD is influenced by the `Stochastic Gradient SVM +<http://leon.bottou.org/projects/sgd>`_ of Léon Bottou. Similar to SvmSGD, +the weight vector is represented as the product of a scalar and a vector +which allows an efficient weight update in the case of L2 regularization. +In the case of sparse feature vectors, the intercept is updated with a +smaller learning rate (multiplied by 0.01) to account for the fact that it is updated more frequently. Training examples are picked up sequentially -and the learning rate is lowered after each observed example. We adopted the -learning rate schedule from Shalev-Shwartz et al. 2007. -For multi-class classification, a "one versus all" approach is used. -We use the truncated gradient algorithm proposed by Tsuruoka et al. 2009 -for L1 regularization (and the Elastic Net). +and the learning rate is lowered after each observed example. We adopted the +learning rate schedule from Shalev-Shwartz et al. 2007. +For multi-class classification, a "one versus all" approach is used. +We use the truncated gradient algorithm proposed by Tsuruoka et al. 2009 +for L1 regularization (and the Elastic Net). The code is written in Cython. .. topic:: References: * `"Stochastic Gradient Descent" <http://leon.bottou.org/projects/sgd>`_ L. Bottou - Website, 2010. - * `"Pegasos: Primal estimated sub-gradient solver for svm" + * `"Pegasos: Primal estimated sub-gradient solver for svm" <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.8513>`_ S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML '07. -- GitLab