From 47110f0462738359b0ffd078299dd36a7fb117ee Mon Sep 17 00:00:00 2001
From: Gael Varoquaux <gael.varoquaux@normalesup.org>
Date: Thu, 22 Apr 2010 19:10:17 +0000
Subject: [PATCH] DOC: Adapt the getting started tutorial to complete
 beginners.

git-svn-id: https://scikit-learn.svn.sourceforge.net/svnroot/scikit-learn/trunk@698 22fbfee3-77ab-4535-9bad-27d1bd3bc7d8
---
 doc/images/last_digit.png | Bin 0 -> 4789 bytes
 doc/tutorial.rst          | 161 +++++++++++++++++++++++++-------------
 2 files changed, 107 insertions(+), 54 deletions(-)
 create mode 100644 doc/images/last_digit.png

diff --git a/doc/images/last_digit.png b/doc/images/last_digit.png
new file mode 100644
index 0000000000000000000000000000000000000000..f6c715a54e216999839fa861dd55d590a5aa585b
GIT binary patch
literal 4789
zcmeAS@N?(olHy`uVBq!ia0y~yV5k6L4mJh`2Fnz)OAHJQEX7WqAsieW95oy%9SjT%
zoCO|{#S9G0!XV6Oo7Q}sfk7zT)5S5QV$R#xo0C@W5YqWj^y|<6?epe+Y<qv$rD{^i
zG_f3;cl%-mSrl$7XinYg`QTyI-neAdpRd>Ne{_Dn{m19$=l}Tgr|10n^S2GBzyA3A
zdAdyhaf|uq3+?7VF0s1X7k%z{Vb9+>`Tcc&yZ+YAmrGB*VP|Le#?JcQ+cHBQ=bpzB
zhYd>dvrp_e6<Yr!_X+!hd_8-H`r@CAKW3liKX5;$j^W?4O>h4?>aKcpetG7s-07!P
z^`@VW?PxY&KYskNVc+q`cgr`vEM0&7bxyD&$D~$uscSutE3{8L$?dym?Ja!ZR_5C0
z3?|H{8c#|8zxGx%L5z*DnqBL_nQh_6*PfrI$5yQVXh+DgHK7Z<&!jdK@;0oB);=Lw
zEcxPWS}@B$?WP#k2uJO=TYDIT*o-zXi2pwP(12Ow`h#ps(YH&M^;I7zIk3<AMz_S8
zgk7IIjGB!nY+NaoX1(sgHHWq-3`g0-rXP5gy?1xX?S$tSQc_>VTxCd`aC7FP?46s=
z9Moc*Y#0%`ebc=4A4QkUJjk`^_Lu6uy{i^&4FApDP^eq}=D<<A;Ki!aZ{8Gmy4JHU
z5O4Kd8@>AP*^rX+d#jw6gz4{M*>>--?9$l>tKLaG3&|4GvRuI5?i`~x-L+5FJNr*G
z_e7DHz0a(@r}OF=UR@a|er=0I1job}S+B3n7GKXC=Ze$55YHMi%{4KIGp4k`NP5k;
zaL2R;?^g!D&(y9zx-m*hZ|`aw5jD#RbFbe%mEap}mA>rRi!B!6svG83ty`*nKT?iw
z@|9<4Z*n&tWcXfSam2)T<=Hir+q7&Ce7p7V&Aj(Y{=Acop09f1n!fT>=B#%%H!j&d
zn&0>A;0#0i<dvUid3W}$SGb!fbfj@+>fET{seB3BW7l7oTp?JfeO{vHuMpezl+E`<
zKOAM)ersFY@+Ic)u4bosPFMZ7J8eTuOM%tgr0utlzAW*&JZI}BQ+9Xr1Evm<RXKNr
z{7t<}cI>!g**x>w!{0?e%XasE{wZ_(vEjDcZ*y<wm~Bo@IZ|=iXio9`b&9?W#p%ta
zd=V=`J@hB29zSUqIy3asr|fxl`6m|nUKW{s_E}n(h{^WU1G_i%NSvRd$F|!~{&*pq
zj{J6m9VQ$F^85C4HIy-=-`py+<HUkYlYmv`#znjD9?RQ4J2&ds+s5}BriC!3U3(Sh
z$gp8W!P(&4jW<=lPQ9wle6{iyU(S@?#}&ahmrme(KKrcNx8JpI%68ut7Ph#q>HCK-
zt4jA}n!#$fH2rce<~y?;3O=g6d)wmJ%aL7eEU<ZlAN#u1?YEf|Y%Vn!e7L?oetY$!
z-5K`N*DIbio1wjYz9T~(+Xm)sSKT(R_`XJVi%MMhhp)SIk3Y6JcKrBpUG485KNfO}
z$`)_G{kEE+r^_-iz_e!N(P?|NW-h&=scTquviV$2#o8R1;}&xLAAf&;f3r)3ZMv+{
zw(r>|7a6R4YMoIqO*ZSKgf6RH@MYaMPg#85oAo``?Ogm|qw}`AdFk1wuT4o4<qiCL
z;(li82jvB6^R~>J?x4%~cA15#^7L9e9_Ldr&r?@>MmX=9yQahP1kai_ZPkFe;kSge
zHWsk2iaoRE#VmGN!E#~crNwq_mqVDv&rRoRF--}aaplwAjv|>mQc_nP`c6o`X;{1R
zMaPEhV?QjkrP~&A>Fvo`e(d^$_Crh7)OgO=(z5r}-Ob-;Te@*=N;cK^y0A0r-s_Fm
zjyHC{-#?f4+#!{Jd@hV--*%|_oLYDO`$fmrWVZ8-vkfaH+HR{b=^uGMbKRRc?5W4@
z^!N8mKfZs8Exk{)MZ5Epj>)RjEm5{<d-bQM>a43gBv>qZR^i{@@AnHWWG?lx>wY@R
zqsPW{`~7yt9cnij3VufU?Kr&n36FrArHe4XZZ+HF=Zg#LW#;=WznRe8G4s~bv$pa}
z-<0>s-oCjtcipMW3}uIuZ#AB3*mNsdG;H(I8o9#xPYNC|1&F<?Sj%qgo2zq!tD&-q
zakG2XeI41|ZF_hU(kEXrD?C2WZ}~cv=3@`%Y)F`3oZjcWw`b?fb{pT|ob9(~X5}$X
z|DCBDDf*zJlFgmDHqF48?b~@(i7zriLFXB+8M^Q9Phq{j^vHv=X~xA`QxDoYtXI@e
zH87tQ?{Lh=vHg?qzg^e+E?(%B)jIQ3tNNogyTN^b_UYTYm;W}K?Q6axUg|-|Pmgdm
z>xPRHBjlLP4jLBmxUg|P7yP``>_tfFoshZ70vkR*kqduX_OL>tZFlG2|5A54#0?_e
z>fcT{`)Y&MoxJVl4{qKLEnsA7VZ3&->R(?|w`sV9j_PtfrFREHdB5uE)U6bMn`5v#
z)$eYu+3eeo%<7gHKdLKT%se5o+-`pQe9MBF2e%sV<T9VQQGC0E_uBTbi+N`+)$QM3
zoD{+OXm|Tz#Z-mW@vQPbCHxx=ckEixH}hm}K`PgT;tewn{ukGh+N{a2^V!0*H!K#j
z&#Hax3ca%X+7kBr{+}N&-D7#qTh3(r^XRIJs}`9GA1Pe%Y=t6^fd9q9^i&4z)dyqV
z6>Kk%P5oiJKK0srg>CF7*39#MP!}mJQ#k))V#Cy(f_1v3N9OE&{`uyuG!Hht=SOaq
zeNNwXu5#8g=>Q3bJ9^KfQ~0K3b>ts>pmQ<M>+jS9PgmN!HK_>?o__eld7HJbV{|Vc
zeHCNR+Z$7I;CIj7`);o8KPJ9gc5=!~E{W#TvNJ3r1r4itSUGt@d#n0VTDvQ=C;!}a
zVK2+4?S-2}ET$V2G%F<Wt@C@hva9T-=DPOPv#r~s&K-F2nI~b}L0zq1r*9|RO;3$H
zyj3`6$0NJu9WQt0oZ>UIV1B#opM~wLmA`|c;||{I5Z~ixF?)9DdCuLU!Z&Nyc+1;4
zZ|W;&*zL+v^RDi($(dQ3w3>|zvJWJ+wluAY)PLS;f2vRH>Pt0EublFy>nB|e{US7>
z^0D~h)S3Fd3+FQytkjikxcRK1R`iB`YJvWNzJoSXr&m7iXL|E(PJh{%5@Ub981`4Q
z6!uR!xc0%PN88TuWwUSE!`yt;?|*pN4^;;HQwgzeCoZ*Aj!SRO&f57TV{cI0qDAr>
z5Bs@ZKTzA&rMK1jw9E=!onxifrfd(p*tb{m#DsJsi)?S*5Up149enSvPuX6!$!LP?
zj`Kkmcd)&Won5u}dAPCoCo!J9!1FimEc$FZM`y3w$-pS4Pum%GOg-r5x|@AZ+|(Bt
zx#65K;$B-@e9lkX?ptE8LT~ccvPFwSZIh?1RFCb<-s=?i%=P+#Qcdr&H6_uvPaZPd
z?s~hW%yPoj*=7F@X&;E2dr<J}{vD?!-iYm+erQR@^Yl#-UJv*VPiXq}G^Zf@smz97
zx0q``SARR7o>i{=>~hKV*I&)zUx&GE<5K(UUA*HCKZ7_o*Bu2N-6hY9!>0$?Mlbno
zFTG{Wmz{Qx7cZE0ZY~$w42x%-6<e#9_GwIgsl*z){izsJlI?ekzVf<eH?BN|WX9%e
z&94`yez4s1&0*KS%?rOr<mU39c>ejP-`tN&tRkX=YnQe%-7LJe^PL3GwQsu&Hn14!
z|IfXf^}wyfV29Y2-MO5_H#>tgE4RPfZCh=-E>qj?*}ek9eaw~HZ=e0c$8d&u#`maa
zuim75)c!k{d10#e6=UW%ZrT1bnQngCne9C--fjPe$O|*|**|@(*t2&+Yv>kZ_8hSU
zHl?JnZyUPKIq0-9Up6tlo$y`KCM2)mwBhX&ml95#+Zp@j$xFF7<NENe(~>tboz;p8
zkNae4zl<xhQ$p;oVVHN5EW`F23Ew66B(e5LuY3RZK;i1QMbZvMic&VIrv85(#Aq+@
zbIrCs5YcdYpGWqJuM@h;^7b?DXy185A%lmxM|H=ajE;Rl`>!V^Hbm~7cmDb8@}9Z_
zlLZA0vtJwVWX`MIfB$?#TEktV=os;Prr+CVO>So_JCOEKcKX#T#hq6(&Y1X?zn*ya
zGE2>zjXR#6Y|EULcTQlQcbe%5EvZ^2Hg}g-uErO5K0GYgQ9aF5*h1>t!h8OS2KSGw
zO!Rdti0s|E^!&`VrxJRK;xf+ZmtI`XE^_3Ju|e_n+jniZb(MWqUKZ2m7}?b<n!+~g
z;8A5`zK9=cx)014d)%hHS)W$0X-+YBaiI;*;j?^}4PRsBE20n7?O)#}6ZwKs`m~&G
z%k7hLI@1r=S?fn#HEB4S&=hqpsd8QZv}Lh88<=;p=vk}pJ##d@V$bE&O?PH9RUF>z
zJ9B%i*5SoV&%~@qymHo1dS3m>Zngz^$=9r&=`n7XOnB~U{dG-Tx_{DP%gD%8LH9)t
zyubD9$B!M=8f&MOJxRV1&engHxA}aji~JO)#X>h$t(Dm!W3=<&295_he0hvJw@uwA
zE@-}WId7Q)^Mm{Q>(4XjHNLj9pMQSywncsCyu-_~tFs;$oj4c$%xW`Ny;ZYlhPn5X
zoP7Q4t0$A#t{+HkDBm3`c;eErhXSipEu3cgTx^_hGa*%XQ~0R_j_aMd_BAb`bJy_P
ziMzHdIj`rIbZGdm0L%aW6VsaROxwHZVTskv>xMhp&u*1BIvet79pkobz5NfRHJaxy
zS|pkxw^i~)MnLAKd!bXb=iK}^>sdnk5|%B$_bffP)FSlt+h6OZ@Rnpnn3tbF%8<=|
z{ZWE>RkdT;_p)Z~-ScnV3R}Qf5O#gdy~z1rn@*=n23I@2``2A={wOM>q?hwn<htNZ
zJW<CUh#h-1?dojjZ9Sn|558>B$zA$;O<Hs8iX}X{ji(=2^j}!HJMOo<;N4X_k36__
z;7L_F^8$C<^yWoZ-<2`OY;N|>Dd2tb=3ZXHPDa*-={FK$O9EDJzGnRTFwYaI+1|`g
zH5qnuU7Is!)~wYV58mBvw!z{eL&N;s=IGaR8l$qk`@4?WhB4l_d1Th)tyjABc-Niz
zXP?z<S`+<9;#DDk@!j&R_EC;?5w6AV+mG!1Gkemi)hpA}|Ig2Cjz0J{apSFfVqfR{
z)X)#SsQc|hAe;N0_3PJV@-pl>!uGw*bE{qP-|(ajO|RI3w0E3kaNZ(xj%|}}Oj7&i
zzEv~X*gptK&sq6AYQ~c-{%il4KAX11)w8&SIVRY2?YHZO3!k~9d=E8!8?q(u`s=IH
zJ}=YSZ|bw^Yoz{#t@pQ|S`?CclI^=iq^{Yr)`QPK|C}~ikMYi<l$Ejho%g3V-C_Es
z{J@Oy&4Jv|sr@ONM4p6MuURcv-MH@PThqyh{!aZG&tH9aZ&gEG#;yWSnLDfBMVYWS
zpI%kbIr*xz(pQ5Ujn|XvH``UM;w_syH*^b|+BdCTX_l*<d7pF~pR054^IoH{uN||N
z<Qzy@Q@e_{Z%6$jzxY|PRj&KZ4u#$cofYnzly~i~D=$Aw$-Z^VADvZo+`%KrusrH&
zv)lZP`ku!b@w#WU?MfRJ#IzDFGM&%dz5RIc+^H<x4CmBtmga7H-Qu%$%L>Wsk&(_n
zcU?BAzid6-c*f0tMLTblH)-rx=YOnl%YkVJo?g{ZZ75B6{UDB~;M#(>Wzy46KmAr#
zlCJL6vzF0@=Wwp3x7nn~rH`X{4Bi+<oK?1u=CJp@EW0BsX@`HshWPc@Z!QQ=5Pg!P
z-MEHlUE?2vX5+0req}xD7<;@vpE&#Ps7kWIjnxkdmuu|c`S7yj|0%U~tEE2(7A~J>
z%x3*9?Lpd`*trKKH>@>Z-)Yyglz;BZSxLWFf6cH>YnXaaRl29>Z*iZGt@VQJZ{?gP
zx+Q9M#=Hxe%)CR)ZO89|*6c|2fBUMwbD6yNPQ6_6|H2%j8Lv;Bj!X=bDK6#Rp1jd2
zMOSN=TDPQ)@8|Rk374Apf{FdtZ~9#K__}B-ukPC`Yhrp2UVpt`zh1KK;W@VNHy*v|
zY@cqtW689ciG4fQ_N0lWy_#~`Yu>K7?Iq`5f3;#3JHBo&kHNX=4BXOd>Q4s$%AR*s
zGKN`VvDND<Yl_~y*|3GL>srO`DLxuY1$OoYTs&-W=ghR^{Ie&SPW)OvZP&`mtcgb%
zuG<|i_mZ3;lrjBu>bKuPmzF%Wu`YX_bNXw9uVC}KosD8<#wYZm3jXh(GE+l6R%_N<
z-uiQq3Kog)B4@w(^{y_UThI1HDucSxyM~Benh$E$eN72IbaT60g5-l++je!WTba5o
zcIoTX;}+$WIa4-rJ^!5Ja77|=_6Lovjjz{hZDm=*9C_G5Tx#8`luXG>^_NZV&e3O6
zdp%{%;vYB89(VHo@E}n7m73`t9p6lcD@o7ZY}Pp$xcP9!o2_DEX^KgnyCl~<zg79C
zud`<**Ba(1@wcgo6T6!?zuZ3Ud;t6O<z;^l9$%{%bS&XoTcqq4YyZT8Bm4PS-|(HA
z+jaU)C}+T33wfKq$GMg|@A|DaEGtk*&pxo}{KYUcYliU8_h01HEH72l+<fp(-E{Tx
uXy35^?+?lU%zuh?wB{4@2lMIw85i4c>wPZLcb$QOfx*+&&t;ucLK6VllVPs_

literal 0
HcmV?d00001

diff --git a/doc/tutorial.rst b/doc/tutorial.rst
index 3a37eecb4d..fb8c89dbe8 100644
--- a/doc/tutorial.rst
+++ b/doc/tutorial.rst
@@ -1,6 +1,13 @@
 Getting started: an introduction to learning with the scikit
 =============================================================
 
+.. topic:: Section contents
+
+    In this section, we introduce the machine learning vocabulary that we
+    use through-out the `scikit.learn` and give a simple example of
+    solving a learning problem.
+
+
 Machine learning: the problem setting
 ---------------------------------------
 
@@ -27,7 +34,17 @@ We can separate learning problems in a few large categories:
  * **unsupervised learning**, in which we are trying to learning a
    synthetic representation of the data.
 
-Loading a sample dataset
+.. topic:: Training set and testing set
+
+    Machine learning is about learning some properties of a data set and
+    applying them to new data. This is why a common practice in machine 
+    learning to evaluate an algorithm is to split the data at hand in two
+    sets, one that we call a *training set* on which we learn data
+    properties, and one that we call a *testing set*, on which we test
+    these properties.
+
+
+Loading an example dataset
 --------------------------
 
 The `scikit.learn` comes with a few standard datasets, for instance the
@@ -39,61 +56,97 @@ the `digits dataset
     >>> iris = datasets.load_iris()
     >>> digits = datasets.load_digits()
 
-A dataset is a dictionary-like object that holds all the samples and
-some metadata about the samples. You can access the underlying data
-with members `.data` and `.target`.
-
-For instance, in the case of the iris dataset, `iris.data` gives access
-to the features that can be used to classify the iris samples::
-
-    >>> iris.data
-    array([[ 5.1,  3.5,  1.4,  0.2],
-	   [ 4.9,  3. ,  1.4,  0.2],
-	   [ 4.7,  3.2,  1.3,  0.2],
-	   ...
-	   [ 6.5,  3. ,  5.2,  2. ],
-	   [ 6.2,  3.4,  5.4,  2.3],
-	   [ 5.9,  3. ,  5.1,  1.8]])
-
-and `iris.target` gives the ground thruth for the iris dataset, that is
-the labels describing the different classes of irises that we are trying
-to learn:
+A dataset is a dictionary-like object that holds all the data and some
+metadata about the data. This data is stored in the `.data` member, which
+is a `n_samples, n_features` array. In the case of supervised problem,
+explanatory variables are stored in the `.target` member.
 
->>> iris.target
-array([ 0.,  0.,  0.,  0., ... 2.,  2.,  2.,  2.])
+For instance, in the case of the digits dataset, `digits.data` gives
+access to the features that can be used to classify the digits samples::
 
+    >>> digits.data
+    array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
+           [  0.,   0.,   0., ...,  10.,   0.,   0.],
+           [  0.,   0.,   0., ...,  16.,   9.,   0.],
+           ..., 
+           [  0.,   0.,   1., ...,   6.,   0.,   0.],
+           [  0.,   0.,   2., ...,  12.,   0.,   0.],
+           [  0.,   0.,  10., ...,  12.,   1.,   0.]])
 
-Prediction
-----------
+and `digits.target` gives the ground truth for the digit dataset, that
+is the number corresponding to each digit image that we are trying to
+learn:
 
-Suppose some given data points each belong to one of two classes, and
-the goal is to decide which class a new data point will be in. In
-``scikits.learn`` this is done with an *estimator*. An *estimator* is
-just a plain Python class that implements the methods fit(X, Y) and
-predict(T).
+>>> digits.target
+array([0, 1, 2, ..., 8, 9, 8])
 
-An example of predictor is the class
-``scikits.learn.neighbors.Neighbors``(XXX ref). The constructor of a predictor
-takes as arguments the parameters of the model. In this case, our only
-parameter is k, the number of neighbors to consider.
-
->>> from scikits.learn import neighbors
->>> clf = neighbors.Neighbors(k=3)
-
-The predictor now must be fitted to the model, that is, it must
-`learn` from the model. This is done by passing our training set to
-the ``fit`` method.
-
->>> clf.fit(iris.data, iris.target) #doctest: +ELLIPSIS
-<scikits.learn.neighbors.Neighbors instance at 0x...>
-
-Now you can predict new values
-
->>> print clf.predict([[0, 0, 0, 0]])
-[[ 0.]]
-
-
-Regression
-----------
-In the regression problem, classes take continous values.
-Linear Regression. TODO
+.. topic:: Shape of the data arrays
+   
+    The data is always are 2D array, `n_samples, n_features`, although
+    the original data may have had a different shape. In the case of the
+    digits, each original sample is an image of shape `8, 8` and can be
+    accessed using:
+
+    >>> digits.images[0]
+    array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
+           [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
+           [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
+           [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
+           [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
+           [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
+           [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
+           [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])
+
+    The :ref:`simple example on this dataset <example_plot_digits_classification.py>`
+    illustrates how starting from the original problem one can shape the 
+    data for consumption in the `scikit.learn`.
+
+
+Learning and Predicting
+------------------------
+
+In the case of the digits dataset, the task is to predict the value of a
+hand-written digit from an image. We are given samples of each of the 10
+possible classes on which we *fit* an `estimator` to be able to *predict*
+the labels corresponding to new data.
+
+In `scikit.learn`, an *estimator* is just a plain Python class that
+implements the methods `fit(X, Y)` and `predict(T)`.
+
+An example of estimator is the class ``scikits.learn.neighbors.SVC`` that
+implements `Support Vector Classification
+<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
+constructor of an estimator takes as arguments the parameters of the
+model, but for the time being, we will consider the estimator as a black
+box and not worry about these:
+
+>>> from scikits.learn import svm
+>>> clf = svm.SVC()
+
+We call our estimator instance `clf` as it is a classifier. It now must
+be fitted to the model, that is, it must `learn` from the model. This is
+done by passing our training set to the ``fit`` method. As a training
+set, let us use the all the images of our dataset appart from the last
+one:
+
+>>> clf.fit(digits.data[:-1], digits.target[:-1]) #doctest: +ELLIPSIS
+<scikits.learn.svm.SVC object at 0x...>
+
+Now you can predict new values, in particular, we can ask to the
+classifier what is the digit of our last image in the `digits` dataset,
+which we have not used to train the classifier:
+
+>>> print clf.predict(digits.data[-1])
+array([ 8.])
+
+The corresponding image is the following:
+
+.. image:: images/last_digit.png
+    :align: center
+
+As you can see, it is a challenging task: the images are of poor
+resolution. Do you agree with the classifier?
+
+A complete example of this classification problem is available as an
+example that you can run and study:
+:ref:`example_plot_digits_classification.py`. 
-- 
GitLab