From 47110f0462738359b0ffd078299dd36a7fb117ee Mon Sep 17 00:00:00 2001 From: Gael Varoquaux <gael.varoquaux@normalesup.org> Date: Thu, 22 Apr 2010 19:10:17 +0000 Subject: [PATCH] DOC: Adapt the getting started tutorial to complete beginners. git-svn-id: https://scikit-learn.svn.sourceforge.net/svnroot/scikit-learn/trunk@698 22fbfee3-77ab-4535-9bad-27d1bd3bc7d8 --- doc/images/last_digit.png | Bin 0 -> 4789 bytes doc/tutorial.rst | 161 +++++++++++++++++++++++++------------- 2 files changed, 107 insertions(+), 54 deletions(-) create mode 100644 doc/images/last_digit.png diff --git a/doc/images/last_digit.png b/doc/images/last_digit.png new file mode 100644 index 0000000000000000000000000000000000000000..f6c715a54e216999839fa861dd55d590a5aa585b GIT binary patch literal 4789 zcmeAS@N?(olHy`uVBq!ia0y~yV5k6L4mJh`2Fnz)OAHJQEX7WqAsieW95oy%9SjT% zoCO|{#S9G0!XV6Oo7Q}sfk7zT)5S5QV$R#xo0C@W5YqWj^y|<6?epe+Y<qv$rD{^i zG_f3;cl%-mSrl$7XinYg`QTyI-neAdpRd>Ne{_Dn{m19$=l}Tgr|10n^S2GBzyA3A zdAdyhaf|uq3+?7VF0s1X7k%z{Vb9+>`Tcc&yZ+YAmrGB*VP|Le#?JcQ+cHBQ=bpzB zhYd>dvrp_e6<Yr!_X+!hd_8-H`r@CAKW3liKX5;$j^W?4O>h4?>aKcpetG7s-07!P z^`@VW?PxY&KYskNVc+q`cgr`vEM0&7bxyD&$D~$uscSutE3{8L$?dym?Ja!ZR_5C0 z3?|H{8c#|8zxGx%L5z*DnqBL_nQh_6*PfrI$5yQVXh+DgHK7Z<&!jdK@;0oB);=Lw zEcxPWS}@B$?WP#k2uJO=TYDIT*o-zXi2pwP(12Ow`h#ps(YH&M^;I7zIk3<AMz_S8 zgk7IIjGB!nY+NaoX1(sgHHWq-3`g0-rXP5gy?1xX?S$tSQc_>VTxCd`aC7FP?46s= z9Moc*Y#0%`ebc=4A4QkUJjk`^_Lu6uy{i^&4FApDP^eq}=D<<A;Ki!aZ{8Gmy4JHU z5O4Kd8@>AP*^rX+d#jw6gz4{M*>>--?9$l>tKLaG3&|4GvRuI5?i`~x-L+5FJNr*G z_e7DHz0a(@r}OF=UR@a|er=0I1job}S+B3n7GKXC=Ze$55YHMi%{4KIGp4k`NP5k; zaL2R;?^g!D&(y9zx-m*hZ|`aw5jD#RbFbe%mEap}mA>rRi!B!6svG83ty`*nKT?iw z@|9<4Z*n&tWcXfSam2)T<=Hir+q7&Ce7p7V&Aj(Y{=Acop09f1n!fT>=B#%%H!j&d zn&0>A;0#0i<dvUid3W}$SGb!fbfj@+>fET{seB3BW7l7oTp?JfeO{vHuMpezl+E`< zKOAM)ersFY@+Ic)u4bosPFMZ7J8eTuOM%tgr0utlzAW*&JZI}BQ+9Xr1Evm<RXKNr z{7t<}cI>!g**x>w!{0?e%XasE{wZ_(vEjDcZ*y<wm~Bo@IZ|=iXio9`b&9?W#p%ta zd=V=`J@hB29zSUqIy3asr|fxl`6m|nUKW{s_E}n(h{^WU1G_i%NSvRd$F|!~{&*pq zj{J6m9VQ$F^85C4HIy-=-`py+<HUkYlYmv`#znjD9?RQ4J2&ds+s5}BriC!3U3(Sh z$gp8W!P(&4jW<=lPQ9wle6{iyU(S@?#}&ahmrme(KKrcNx8JpI%68ut7Ph#q>HCK- zt4jA}n!#$fH2rce<~y?;3O=g6d)wmJ%aL7eEU<ZlAN#u1?YEf|Y%Vn!e7L?oetY$! z-5K`N*DIbio1wjYz9T~(+Xm)sSKT(R_`XJVi%MMhhp)SIk3Y6JcKrBpUG485KNfO} z$`)_G{kEE+r^_-iz_e!N(P?|NW-h&=scTquviV$2#o8R1;}&xLAAf&;f3r)3ZMv+{ zw(r>|7a6R4YMoIqO*ZSKgf6RH@MYaMPg#85oAo``?Ogm|qw}`AdFk1wuT4o4<qiCL z;(li82jvB6^R~>J?x4%~cA15#^7L9e9_Ldr&r?@>MmX=9yQahP1kai_ZPkFe;kSge zHWsk2iaoRE#VmGN!E#~crNwq_mqVDv&rRoRF--}aaplwAjv|>mQc_nP`c6o`X;{1R zMaPEhV?QjkrP~&A>Fvo`e(d^$_Crh7)OgO=(z5r}-Ob-;Te@*=N;cK^y0A0r-s_Fm zjyHC{-#?f4+#!{Jd@hV--*%|_oLYDO`$fmrWVZ8-vkfaH+HR{b=^uGMbKRRc?5W4@ z^!N8mKfZs8Exk{)MZ5Epj>)RjEm5{<d-bQM>a43gBv>qZR^i{@@AnHWWG?lx>wY@R zqsPW{`~7yt9cnij3VufU?Kr&n36FrArHe4XZZ+HF=Zg#LW#;=WznRe8G4s~bv$pa} z-<0>s-oCjtcipMW3}uIuZ#AB3*mNsdG;H(I8o9#xPYNC|1&F<?Sj%qgo2zq!tD&-q zakG2XeI41|ZF_hU(kEXrD?C2WZ}~cv=3@`%Y)F`3oZjcWw`b?fb{pT|ob9(~X5}$X z|DCBDDf*zJlFgmDHqF48?b~@(i7zriLFXB+8M^Q9Phq{j^vHv=X~xA`QxDoYtXI@e zH87tQ?{Lh=vHg?qzg^e+E?(%B)jIQ3tNNogyTN^b_UYTYm;W}K?Q6axUg|-|Pmgdm z>xPRHBjlLP4jLBmxUg|P7yP``>_tfFoshZ70vkR*kqduX_OL>tZFlG2|5A54#0?_e z>fcT{`)Y&MoxJVl4{qKLEnsA7VZ3&->R(?|w`sV9j_PtfrFREHdB5uE)U6bMn`5v# z)$eYu+3eeo%<7gHKdLKT%se5o+-`pQe9MBF2e%sV<T9VQQGC0E_uBTbi+N`+)$QM3 zoD{+OXm|Tz#Z-mW@vQPbCHxx=ckEixH}hm}K`PgT;tewn{ukGh+N{a2^V!0*H!K#j z&#Hax3ca%X+7kBr{+}N&-D7#qTh3(r^XRIJs}`9GA1Pe%Y=t6^fd9q9^i&4z)dyqV z6>Kk%P5oiJKK0srg>CF7*39#MP!}mJQ#k))V#Cy(f_1v3N9OE&{`uyuG!Hht=SOaq zeNNwXu5#8g=>Q3bJ9^KfQ~0K3b>ts>pmQ<M>+jS9PgmN!HK_>?o__eld7HJbV{|Vc zeHCNR+Z$7I;CIj7`);o8KPJ9gc5=!~E{W#TvNJ3r1r4itSUGt@d#n0VTDvQ=C;!}a zVK2+4?S-2}ET$V2G%F<Wt@C@hva9T-=DPOPv#r~s&K-F2nI~b}L0zq1r*9|RO;3$H zyj3`6$0NJu9WQt0oZ>UIV1B#opM~wLmA`|c;||{I5Z~ixF?)9DdCuLU!Z&Nyc+1;4 zZ|W;&*zL+v^RDi($(dQ3w3>|zvJWJ+wluAY)PLS;f2vRH>Pt0EublFy>nB|e{US7> z^0D~h)S3Fd3+FQytkjikxcRK1R`iB`YJvWNzJoSXr&m7iXL|E(PJh{%5@Ub981`4Q z6!uR!xc0%PN88TuWwUSE!`yt;?|*pN4^;;HQwgzeCoZ*Aj!SRO&f57TV{cI0qDAr> z5Bs@ZKTzA&rMK1jw9E=!onxifrfd(p*tb{m#DsJsi)?S*5Up149enSvPuX6!$!LP? zj`Kkmcd)&Won5u}dAPCoCo!J9!1FimEc$FZM`y3w$-pS4Pum%GOg-r5x|@AZ+|(Bt zx#65K;$B-@e9lkX?ptE8LT~ccvPFwSZIh?1RFCb<-s=?i%=P+#Qcdr&H6_uvPaZPd z?s~hW%yPoj*=7F@X&;E2dr<J}{vD?!-iYm+erQR@^Yl#-UJv*VPiXq}G^Zf@smz97 zx0q``SARR7o>i{=>~hKV*I&)zUx&GE<5K(UUA*HCKZ7_o*Bu2N-6hY9!>0$?Mlbno zFTG{Wmz{Qx7cZE0ZY~$w42x%-6<e#9_GwIgsl*z){izsJlI?ekzVf<eH?BN|WX9%e z&94`yez4s1&0*KS%?rOr<mU39c>ejP-`tN&tRkX=YnQe%-7LJe^PL3GwQsu&Hn14! z|IfXf^}wyfV29Y2-MO5_H#>tgE4RPfZCh=-E>qj?*}ek9eaw~HZ=e0c$8d&u#`maa zuim75)c!k{d10#e6=UW%ZrT1bnQngCne9C--fjPe$O|*|**|@(*t2&+Yv>kZ_8hSU zHl?JnZyUPKIq0-9Up6tlo$y`KCM2)mwBhX&ml95#+Zp@j$xFF7<NENe(~>tboz;p8 zkNae4zl<xhQ$p;oVVHN5EW`F23Ew66B(e5LuY3RZK;i1QMbZvMic&VIrv85(#Aq+@ zbIrCs5YcdYpGWqJuM@h;^7b?DXy185A%lmxM|H=ajE;Rl`>!V^Hbm~7cmDb8@}9Z_ zlLZA0vtJwVWX`MIfB$?#TEktV=os;Prr+CVO>So_JCOEKcKX#T#hq6(&Y1X?zn*ya zGE2>zjXR#6Y|EULcTQlQcbe%5EvZ^2Hg}g-uErO5K0GYgQ9aF5*h1>t!h8OS2KSGw zO!Rdti0s|E^!&`VrxJRK;xf+ZmtI`XE^_3Ju|e_n+jniZb(MWqUKZ2m7}?b<n!+~g z;8A5`zK9=cx)014d)%hHS)W$0X-+YBaiI;*;j?^}4PRsBE20n7?O)#}6ZwKs`m~&G z%k7hLI@1r=S?fn#HEB4S&=hqpsd8QZv}Lh88<=;p=vk}pJ##d@V$bE&O?PH9RUF>z zJ9B%i*5SoV&%~@qymHo1dS3m>Zngz^$=9r&=`n7XOnB~U{dG-Tx_{DP%gD%8LH9)t zyubD9$B!M=8f&MOJxRV1&engHxA}aji~JO)#X>h$t(Dm!W3=<&295_he0hvJw@uwA zE@-}WId7Q)^Mm{Q>(4XjHNLj9pMQSywncsCyu-_~tFs;$oj4c$%xW`Ny;ZYlhPn5X zoP7Q4t0$A#t{+HkDBm3`c;eErhXSipEu3cgTx^_hGa*%XQ~0R_j_aMd_BAb`bJy_P ziMzHdIj`rIbZGdm0L%aW6VsaROxwHZVTskv>xMhp&u*1BIvet79pkobz5NfRHJaxy zS|pkxw^i~)MnLAKd!bXb=iK}^>sdnk5|%B$_bffP)FSlt+h6OZ@Rnpnn3tbF%8<=| z{ZWE>RkdT;_p)Z~-ScnV3R}Qf5O#gdy~z1rn@*=n23I@2``2A={wOM>q?hwn<htNZ zJW<CUh#h-1?dojjZ9Sn|558>B$zA$;O<Hs8iX}X{ji(=2^j}!HJMOo<;N4X_k36__ z;7L_F^8$C<^yWoZ-<2`OY;N|>Dd2tb=3ZXHPDa*-={FK$O9EDJzGnRTFwYaI+1|`g zH5qnuU7Is!)~wYV58mBvw!z{eL&N;s=IGaR8l$qk`@4?WhB4l_d1Th)tyjABc-Niz zXP?z<S`+<9;#DDk@!j&R_EC;?5w6AV+mG!1Gkemi)hpA}|Ig2Cjz0J{apSFfVqfR{ z)X)#SsQc|hAe;N0_3PJV@-pl>!uGw*bE{qP-|(ajO|RI3w0E3kaNZ(xj%|}}Oj7&i zzEv~X*gptK&sq6AYQ~c-{%il4KAX11)w8&SIVRY2?YHZO3!k~9d=E8!8?q(u`s=IH zJ}=YSZ|bw^Yoz{#t@pQ|S`?CclI^=iq^{Yr)`QPK|C}~ikMYi<l$Ejho%g3V-C_Es z{J@Oy&4Jv|sr@ONM4p6MuURcv-MH@PThqyh{!aZG&tH9aZ&gEG#;yWSnLDfBMVYWS zpI%kbIr*xz(pQ5Ujn|XvH``UM;w_syH*^b|+BdCTX_l*<d7pF~pR054^IoH{uN||N z<Qzy@Q@e_{Z%6$jzxY|PRj&KZ4u#$cofYnzly~i~D=$Aw$-Z^VADvZo+`%KrusrH& zv)lZP`ku!b@w#WU?MfRJ#IzDFGM&%dz5RIc+^H<x4CmBtmga7H-Qu%$%L>Wsk&(_n zcU?BAzid6-c*f0tMLTblH)-rx=YOnl%YkVJo?g{ZZ75B6{UDB~;M#(>Wzy46KmAr# zlCJL6vzF0@=Wwp3x7nn~rH`X{4Bi+<oK?1u=CJp@EW0BsX@`HshWPc@Z!QQ=5Pg!P z-MEHlUE?2vX5+0req}xD7<;@vpE&#Ps7kWIjnxkdmuu|c`S7yj|0%U~tEE2(7A~J> z%x3*9?Lpd`*trKKH>@>Z-)Yyglz;BZSxLWFf6cH>YnXaaRl29>Z*iZGt@VQJZ{?gP zx+Q9M#=Hxe%)CR)ZO89|*6c|2fBUMwbD6yNPQ6_6|H2%j8Lv;Bj!X=bDK6#Rp1jd2 zMOSN=TDPQ)@8|Rk374Apf{FdtZ~9#K__}B-ukPC`Yhrp2UVpt`zh1KK;W@VNHy*v| zY@cqtW689ciG4fQ_N0lWy_#~`Yu>K7?Iq`5f3;#3JHBo&kHNX=4BXOd>Q4s$%AR*s zGKN`VvDND<Yl_~y*|3GL>srO`DLxuY1$OoYTs&-W=ghR^{Ie&SPW)OvZP&`mtcgb% zuG<|i_mZ3;lrjBu>bKuPmzF%Wu`YX_bNXw9uVC}KosD8<#wYZm3jXh(GE+l6R%_N< z-uiQq3Kog)B4@w(^{y_UThI1HDucSxyM~Benh$E$eN72IbaT60g5-l++je!WTba5o zcIoTX;}+$WIa4-rJ^!5Ja77|=_6Lovjjz{hZDm=*9C_G5Tx#8`luXG>^_NZV&e3O6 zdp%{%;vYB89(VHo@E}n7m73`t9p6lcD@o7ZY}Pp$xcP9!o2_DEX^KgnyCl~<zg79C zud`<**Ba(1@wcgo6T6!?zuZ3Ud;t6O<z;^l9$%{%bS&XoTcqq4YyZT8Bm4PS-|(HA z+jaU)C}+T33wfKq$GMg|@A|DaEGtk*&pxo}{KYUcYliU8_h01HEH72l+<fp(-E{Tx uXy35^?+?lU%zuh?wB{4@2lMIw85i4c>wPZLcb$QOfx*+&&t;ucLK6VllVPs_ literal 0 HcmV?d00001 diff --git a/doc/tutorial.rst b/doc/tutorial.rst index 3a37eecb4d..fb8c89dbe8 100644 --- a/doc/tutorial.rst +++ b/doc/tutorial.rst @@ -1,6 +1,13 @@ Getting started: an introduction to learning with the scikit ============================================================= +.. topic:: Section contents + + In this section, we introduce the machine learning vocabulary that we + use through-out the `scikit.learn` and give a simple example of + solving a learning problem. + + Machine learning: the problem setting --------------------------------------- @@ -27,7 +34,17 @@ We can separate learning problems in a few large categories: * **unsupervised learning**, in which we are trying to learning a synthetic representation of the data. -Loading a sample dataset +.. topic:: Training set and testing set + + Machine learning is about learning some properties of a data set and + applying them to new data. This is why a common practice in machine + learning to evaluate an algorithm is to split the data at hand in two + sets, one that we call a *training set* on which we learn data + properties, and one that we call a *testing set*, on which we test + these properties. + + +Loading an example dataset -------------------------- The `scikit.learn` comes with a few standard datasets, for instance the @@ -39,61 +56,97 @@ the `digits dataset >>> iris = datasets.load_iris() >>> digits = datasets.load_digits() -A dataset is a dictionary-like object that holds all the samples and -some metadata about the samples. You can access the underlying data -with members `.data` and `.target`. - -For instance, in the case of the iris dataset, `iris.data` gives access -to the features that can be used to classify the iris samples:: - - >>> iris.data - array([[ 5.1, 3.5, 1.4, 0.2], - [ 4.9, 3. , 1.4, 0.2], - [ 4.7, 3.2, 1.3, 0.2], - ... - [ 6.5, 3. , 5.2, 2. ], - [ 6.2, 3.4, 5.4, 2.3], - [ 5.9, 3. , 5.1, 1.8]]) - -and `iris.target` gives the ground thruth for the iris dataset, that is -the labels describing the different classes of irises that we are trying -to learn: +A dataset is a dictionary-like object that holds all the data and some +metadata about the data. This data is stored in the `.data` member, which +is a `n_samples, n_features` array. In the case of supervised problem, +explanatory variables are stored in the `.target` member. ->>> iris.target -array([ 0., 0., 0., 0., ... 2., 2., 2., 2.]) +For instance, in the case of the digits dataset, `digits.data` gives +access to the features that can be used to classify the digits samples:: + >>> digits.data + array([[ 0., 0., 5., ..., 0., 0., 0.], + [ 0., 0., 0., ..., 10., 0., 0.], + [ 0., 0., 0., ..., 16., 9., 0.], + ..., + [ 0., 0., 1., ..., 6., 0., 0.], + [ 0., 0., 2., ..., 12., 0., 0.], + [ 0., 0., 10., ..., 12., 1., 0.]]) -Prediction ----------- +and `digits.target` gives the ground truth for the digit dataset, that +is the number corresponding to each digit image that we are trying to +learn: -Suppose some given data points each belong to one of two classes, and -the goal is to decide which class a new data point will be in. In -``scikits.learn`` this is done with an *estimator*. An *estimator* is -just a plain Python class that implements the methods fit(X, Y) and -predict(T). +>>> digits.target +array([0, 1, 2, ..., 8, 9, 8]) -An example of predictor is the class -``scikits.learn.neighbors.Neighbors``(XXX ref). The constructor of a predictor -takes as arguments the parameters of the model. In this case, our only -parameter is k, the number of neighbors to consider. - ->>> from scikits.learn import neighbors ->>> clf = neighbors.Neighbors(k=3) - -The predictor now must be fitted to the model, that is, it must -`learn` from the model. This is done by passing our training set to -the ``fit`` method. - ->>> clf.fit(iris.data, iris.target) #doctest: +ELLIPSIS -<scikits.learn.neighbors.Neighbors instance at 0x...> - -Now you can predict new values - ->>> print clf.predict([[0, 0, 0, 0]]) -[[ 0.]] - - -Regression ----------- -In the regression problem, classes take continous values. -Linear Regression. TODO +.. topic:: Shape of the data arrays + + The data is always are 2D array, `n_samples, n_features`, although + the original data may have had a different shape. In the case of the + digits, each original sample is an image of shape `8, 8` and can be + accessed using: + + >>> digits.images[0] + array([[ 0., 0., 5., 13., 9., 1., 0., 0.], + [ 0., 0., 13., 15., 10., 15., 5., 0.], + [ 0., 3., 15., 2., 0., 11., 8., 0.], + [ 0., 4., 12., 0., 0., 8., 8., 0.], + [ 0., 5., 8., 0., 0., 9., 8., 0.], + [ 0., 4., 11., 0., 1., 12., 7., 0.], + [ 0., 2., 14., 5., 10., 12., 0., 0.], + [ 0., 0., 6., 13., 10., 0., 0., 0.]]) + + The :ref:`simple example on this dataset <example_plot_digits_classification.py>` + illustrates how starting from the original problem one can shape the + data for consumption in the `scikit.learn`. + + +Learning and Predicting +------------------------ + +In the case of the digits dataset, the task is to predict the value of a +hand-written digit from an image. We are given samples of each of the 10 +possible classes on which we *fit* an `estimator` to be able to *predict* +the labels corresponding to new data. + +In `scikit.learn`, an *estimator* is just a plain Python class that +implements the methods `fit(X, Y)` and `predict(T)`. + +An example of estimator is the class ``scikits.learn.neighbors.SVC`` that +implements `Support Vector Classification +<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The +constructor of an estimator takes as arguments the parameters of the +model, but for the time being, we will consider the estimator as a black +box and not worry about these: + +>>> from scikits.learn import svm +>>> clf = svm.SVC() + +We call our estimator instance `clf` as it is a classifier. It now must +be fitted to the model, that is, it must `learn` from the model. This is +done by passing our training set to the ``fit`` method. As a training +set, let us use the all the images of our dataset appart from the last +one: + +>>> clf.fit(digits.data[:-1], digits.target[:-1]) #doctest: +ELLIPSIS +<scikits.learn.svm.SVC object at 0x...> + +Now you can predict new values, in particular, we can ask to the +classifier what is the digit of our last image in the `digits` dataset, +which we have not used to train the classifier: + +>>> print clf.predict(digits.data[-1]) +array([ 8.]) + +The corresponding image is the following: + +.. image:: images/last_digit.png + :align: center + +As you can see, it is a challenging task: the images are of poor +resolution. Do you agree with the classifier? + +A complete example of this classification problem is available as an +example that you can run and study: +:ref:`example_plot_digits_classification.py`. -- GitLab