DOC: Adapt the getting started tutorial to complete beginners.

git-svn-id: https://scikit-learn.svn.sourceforge.net/svnroot/scikit-learn/trunk@698 22fbfee3-77ab-4535-9bad-27d1bd3bc7d8

DOC: Adapt the getting started tutorial to complete beginners.
47110f04 · Gael Varoquaux · a9a5003d · 47110f04 · 47110f04
Commit 47110f04 authored 15 years ago by Gael Varoquaux
--- a/doc/images/last_digit.png
+++ b/doc/images/last_digit.png
--- a/doc/tutorial.rst
+++ b/doc/tutorial.rst
 Getting started: an introduction to learning with the scikit
 =============================================================
+.. topic:: Section contents
+    In this section, we introduce the machine learning vocabulary that we
+    use through-out the `scikit.learn` and give a simple example of
+    solving a learning problem.
 Machine learning: the problem setting
 ---------------------------------------
@@ -27,7 +34,17 @@ We can separate learning problems in a few large categories:
 * **unsupervised learning**, in which we are trying to learning a
   synthetic representation of the data.
-Loading a sample dataset
+.. topic:: Training set and testing set
+    Machine learning is about learning some properties of a data set and
+    applying them to new data. This is why a common practice in machine 
+    learning to evaluate an algorithm is to split the data at hand in two
+    sets, one that we call a *training set* on which we learn data
+    properties, and one that we call a *testing set*, on which we test
+    these properties.
+Loading an example dataset
 --------------------------
 The `scikit.learn` comes with a few standard datasets, for instance the
@@ -39,61 +56,97 @@ the `digits dataset
    >>> iris = datasets.load_iris()
    >>> digits = datasets.load_digits()
-A dataset is a dictionary-like object that holds all the samples and
+A dataset is a dictionary-like object that holds all the data and some
-some metadata about the samples. You can access the underlying data
+metadata about the data. This data is stored in the `.data` member, which
-with members `.data` and `.target`.
+is a `n_samples, n_features` array. In the case of supervised problem,
+explanatory variables are stored in the `.target` member.
+For instance, in the case of the digits dataset, `digits.data` gives
+access to the features that can be used to classify the digits samples::
+    >>> digits.data
+    array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
+           [  0.,   0.,   0., ...,  10.,   0.,   0.],
+           [  0.,   0.,   0., ...,  16.,   9.,   0.],
+           ..., 
+           [  0.,   0.,   1., ...,   6.,   0.,   0.],
+           [  0.,   0.,   2., ...,  12.,   0.,   0.],
+           [  0.,   0.,  10., ...,  12.,   1.,   0.]])
+and `digits.target` gives the ground truth for the digit dataset, that
+is the number corresponding to each digit image that we are trying to
+learn:
+>>> digits.target
+array([0, 1, 2, ..., 8, 9, 8])
+.. topic:: Shape of the data arrays
+    The data is always are 2D array, `n_samples, n_features`, although
+    the original data may have had a different shape. In the case of the
+    digits, each original sample is an image of shape `8, 8` and can be
+    accessed using:
+    >>> digits.images[0]
+    array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
+           [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
+           [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
+           [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
+           [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
+           [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
+           [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
+           [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])
-For instance, in the case of the iris dataset, `iris.data` gives access
+    The :ref:`simple example on this dataset <example_plot_digits_classification.py>`
-to the features that can be used to classify the iris samples::
+    illustrates how starting from the original problem one can shape the 
+    data for consumption in the `scikit.learn`.
-    >>> iris.data
-    array([[ 5.1,  3.5,  1.4,  0.2],
-	   [ 4.9,  3. ,  1.4,  0.2],
-	   [ 4.7,  3.2,  1.3,  0.2],
-	   ...
-	   [ 6.5,  3. ,  5.2,  2. ],
-	   [ 6.2,  3.4,  5.4,  2.3],
-	   [ 5.9,  3. ,  5.1,  1.8]])
-and `iris.target` gives the ground thruth for the iris dataset, that is
+Learning and Predicting
-the labels describing the different classes of irises that we are trying
+------------------------
-to learn:
->>> iris.target
+In the case of the digits dataset, the task is to predict the value of a
-array([ 0.,  0.,  0.,  0., ... 2.,  2.,  2.,  2.])
+hand-written digit from an image. We are given samples of each of the 10
+possible classes on which we *fit* an `estimator` to be able to *predict*
+the labels corresponding to new data.
+In `scikit.learn`, an *estimator* is just a plain Python class that
+implements the methods `fit(X, Y)` and `predict(T)`.
-Prediction
+An example of estimator is the class ``scikits.learn.neighbors.SVC`` that
----------
+implements `Support Vector Classification
+<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
+constructor of an estimator takes as arguments the parameters of the
+model, but for the time being, we will consider the estimator as a black
+box and not worry about these:
-Suppose some given data points each belong to one of two classes, and
+>>> from scikits.learn import svm
-the goal is to decide which class a new data point will be in. In
+>>> clf = svm.SVC()
-``scikits.learn`` this is done with an *estimator*. An *estimator* is
-just a plain Python class that implements the methods fit(X, Y) and
-predict(T).
-An example of predictor is the class
+We call our estimator instance `clf` as it is a classifier. It now must
-``scikits.learn.neighbors.Neighbors``(XXX ref). The constructor of a predictor
+be fitted to the model, that is, it must `learn` from the model. This is
-takes as arguments the parameters of the model. In this case, our only
+done by passing our training set to the ``fit`` method. As a training
-parameter is k, the number of neighbors to consider.
+set, let us use the all the images of our dataset appart from the last
+one:
->>> from scikits.learn import neighbors
+>>> clf.fit(digits.data[:-1], digits.target[:-1]) #doctest: +ELLIPSIS
->>> clf = neighbors.Neighbors(k=3)
+<scikits.learn.svm.SVC object at 0x...>
-The predictor now must be fitted to the model, that is, it must
+Now you can predict new values, in particular, we can ask to the
-`learn` from the model. This is done by passing our training set to
+classifier what is the digit of our last image in the `digits` dataset,
-the ``fit`` method.
+which we have not used to train the classifier:
->>> clf.fit(iris.data, iris.target) #doctest: +ELLIPSIS
+>>> print clf.predict(digits.data[-1])
-<scikits.learn.neighbors.Neighbors instance at 0x...>
+array([ 8.])
-Now you can predict new values
+The corresponding image is the following:
->>> print clf.predict([[0, 0, 0, 0]])
+.. image:: images/last_digit.png
-[[ 0.]]
+    :align: center
+As you can see, it is a challenging task: the images are of poor
+resolution. Do you agree with the classifier?
-Regression
+A complete example of this classification problem is available as an
----------
+example that you can run and study:
-In the regression problem, classes take continous values.
+:ref:`example_plot_digits_classification.py`. 
-Linear Regression. TODO