Skip to content
Snippets Groups Projects
Commit 47110f04 authored by Gael Varoquaux's avatar Gael Varoquaux
Browse files

DOC: Adapt the getting started tutorial to complete beginners.


git-svn-id: https://scikit-learn.svn.sourceforge.net/svnroot/scikit-learn/trunk@698 22fbfee3-77ab-4535-9bad-27d1bd3bc7d8
parent a9a5003d
No related branches found
No related tags found
No related merge requests found
doc/images/last_digit.png

4.68 KiB

Getting started: an introduction to learning with the scikit Getting started: an introduction to learning with the scikit
============================================================= =============================================================
.. topic:: Section contents
In this section, we introduce the machine learning vocabulary that we
use through-out the `scikit.learn` and give a simple example of
solving a learning problem.
Machine learning: the problem setting Machine learning: the problem setting
--------------------------------------- ---------------------------------------
...@@ -27,7 +34,17 @@ We can separate learning problems in a few large categories: ...@@ -27,7 +34,17 @@ We can separate learning problems in a few large categories:
* **unsupervised learning**, in which we are trying to learning a * **unsupervised learning**, in which we are trying to learning a
synthetic representation of the data. synthetic representation of the data.
Loading a sample dataset .. topic:: Training set and testing set
Machine learning is about learning some properties of a data set and
applying them to new data. This is why a common practice in machine
learning to evaluate an algorithm is to split the data at hand in two
sets, one that we call a *training set* on which we learn data
properties, and one that we call a *testing set*, on which we test
these properties.
Loading an example dataset
-------------------------- --------------------------
The `scikit.learn` comes with a few standard datasets, for instance the The `scikit.learn` comes with a few standard datasets, for instance the
...@@ -39,61 +56,97 @@ the `digits dataset ...@@ -39,61 +56,97 @@ the `digits dataset
>>> iris = datasets.load_iris() >>> iris = datasets.load_iris()
>>> digits = datasets.load_digits() >>> digits = datasets.load_digits()
A dataset is a dictionary-like object that holds all the samples and A dataset is a dictionary-like object that holds all the data and some
some metadata about the samples. You can access the underlying data metadata about the data. This data is stored in the `.data` member, which
with members `.data` and `.target`. is a `n_samples, n_features` array. In the case of supervised problem,
explanatory variables are stored in the `.target` member.
For instance, in the case of the digits dataset, `digits.data` gives
access to the features that can be used to classify the digits samples::
>>> digits.data
array([[ 0., 0., 5., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.],
[ 0., 0., 0., ..., 16., 9., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 2., ..., 12., 0., 0.],
[ 0., 0., 10., ..., 12., 1., 0.]])
and `digits.target` gives the ground truth for the digit dataset, that
is the number corresponding to each digit image that we are trying to
learn:
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])
.. topic:: Shape of the data arrays
The data is always are 2D array, `n_samples, n_features`, although
the original data may have had a different shape. In the case of the
digits, each original sample is an image of shape `8, 8` and can be
accessed using:
>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
For instance, in the case of the iris dataset, `iris.data` gives access The :ref:`simple example on this dataset <example_plot_digits_classification.py>`
to the features that can be used to classify the iris samples:: illustrates how starting from the original problem one can shape the
data for consumption in the `scikit.learn`.
>>> iris.data
array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
...
[ 6.5, 3. , 5.2, 2. ],
[ 6.2, 3.4, 5.4, 2.3],
[ 5.9, 3. , 5.1, 1.8]])
and `iris.target` gives the ground thruth for the iris dataset, that is Learning and Predicting
the labels describing the different classes of irises that we are trying ------------------------
to learn:
>>> iris.target In the case of the digits dataset, the task is to predict the value of a
array([ 0., 0., 0., 0., ... 2., 2., 2., 2.]) hand-written digit from an image. We are given samples of each of the 10
possible classes on which we *fit* an `estimator` to be able to *predict*
the labels corresponding to new data.
In `scikit.learn`, an *estimator* is just a plain Python class that
implements the methods `fit(X, Y)` and `predict(T)`.
Prediction An example of estimator is the class ``scikits.learn.neighbors.SVC`` that
---------- implements `Support Vector Classification
<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
constructor of an estimator takes as arguments the parameters of the
model, but for the time being, we will consider the estimator as a black
box and not worry about these:
Suppose some given data points each belong to one of two classes, and >>> from scikits.learn import svm
the goal is to decide which class a new data point will be in. In >>> clf = svm.SVC()
``scikits.learn`` this is done with an *estimator*. An *estimator* is
just a plain Python class that implements the methods fit(X, Y) and
predict(T).
An example of predictor is the class We call our estimator instance `clf` as it is a classifier. It now must
``scikits.learn.neighbors.Neighbors``(XXX ref). The constructor of a predictor be fitted to the model, that is, it must `learn` from the model. This is
takes as arguments the parameters of the model. In this case, our only done by passing our training set to the ``fit`` method. As a training
parameter is k, the number of neighbors to consider. set, let us use the all the images of our dataset appart from the last
one:
>>> from scikits.learn import neighbors >>> clf.fit(digits.data[:-1], digits.target[:-1]) #doctest: +ELLIPSIS
>>> clf = neighbors.Neighbors(k=3) <scikits.learn.svm.SVC object at 0x...>
The predictor now must be fitted to the model, that is, it must Now you can predict new values, in particular, we can ask to the
`learn` from the model. This is done by passing our training set to classifier what is the digit of our last image in the `digits` dataset,
the ``fit`` method. which we have not used to train the classifier:
>>> clf.fit(iris.data, iris.target) #doctest: +ELLIPSIS >>> print clf.predict(digits.data[-1])
<scikits.learn.neighbors.Neighbors instance at 0x...> array([ 8.])
Now you can predict new values The corresponding image is the following:
>>> print clf.predict([[0, 0, 0, 0]]) .. image:: images/last_digit.png
[[ 0.]] :align: center
As you can see, it is a challenging task: the images are of poor
resolution. Do you agree with the classifier?
Regression A complete example of this classification problem is available as an
---------- example that you can run and study:
In the regression problem, classes take continous values. :ref:`example_plot_digits_classification.py`.
Linear Regression. TODO
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment