diff --git a/doc/datasets/index.rst b/doc/datasets/index.rst index 2e36c85b659da5dae9ff11c5b5607386782e21b7..09abd44bcebcacdc5c4f2694a9586f16d590226b 100644 --- a/doc/datasets/index.rst +++ b/doc/datasets/index.rst @@ -27,15 +27,16 @@ The simplest one is the interface for sample images, which is described below in the :ref:`sample_images` section. The dataset generation functions and the svmlight loader share a simplistic -interface, returning a tuple ``(X, y)`` consisting of a n_samples x n_features -numpy array X and an array of length n_samples containing the targets y. +interface, returning a tuple ``(X, y)`` consisting of a ``n_samples`` * +``n_features`` numpy array ``X`` and an array of length ``n_samples`` + containing the targets ``y``. The toy datasets as well as the 'real world' datasets and the datasets fetched from mldata.org have more sophisticated structure. These functions return a dictionary-like object holding at least two items: -an array of shape ``n_samples`` * `` n_features`` with key ``data`` +an array of shape ``n_samples`` * ``n_features`` with key ``data`` (except for 20newsgroups) -and a NumPy array of length ``n_samples``, containing the target values, +and a numpy array of length ``n_samples``, containing the target values, with key ``target``. The datasets also contain a description in ``DESCR`` and some contain diff --git a/doc/datasets/labeled_faces.rst b/doc/datasets/labeled_faces.rst index 8786335e6fec02f79f10b955fb6e13b74421eaf2..85a2b41a6475bbbaa742e757a745e7cf216d21d6 100644 --- a/doc/datasets/labeled_faces.rst +++ b/doc/datasets/labeled_faces.rst @@ -92,15 +92,16 @@ is a pair of two picture belonging or not to the same person:: >>> lfw_pairs_train.target.shape (2200,) -Both for the ``fetch_lfw_people`` and ``fetch_lfw_pairs`` function it is +Both for the :func:`sklearn.datasets.fetch_lfw_people` and +:func:`sklearn.datasets.fetch_lfw_pairs` function it is possible to get an additional dimension with the RGB color channels by passing ``color=True``, in that case the shape will be ``(2200, 2, 62, 47, 3)``. -The ``fetch_lfw_pairs`` datasets is subdivided into 3 subsets: the development -``train`` set, the development ``test`` set and an evaluation ``10_folds`` -set meant to compute performance metrics using a 10-folds cross -validation scheme. +The :func:`sklearn.datasets.fetch_lfw_pairs` datasets is subdivided into +3 subsets: the development ``train`` set, the development ``test`` set and +an evaluation ``10_folds`` set meant to compute performance metrics using a +10-folds cross validation scheme. .. topic:: References: diff --git a/doc/datasets/mldata.rst b/doc/datasets/mldata.rst index 5620a43df3ee5ceb287729566edea8e15fb655e7..5083317cffc5300a459bd6df8b0fd56a62255b4f 100644 --- a/doc/datasets/mldata.rst +++ b/doc/datasets/mldata.rst @@ -13,7 +13,8 @@ Downloading datasets from the mldata.org repository data, supported by the `PASCAL network <http://www.pascal-network.org>`_ . The ``sklearn.datasets`` package is able to directly download data -sets from the repository using the function ``fetch_mldata(dataname)``. +sets from the repository using the function +:func:`sklearn.datasets.fetch_mldata`. For example, to download the MNIST digit recognition database:: @@ -38,14 +39,15 @@ specified by the ``data_home`` keyword argument, which defaults to ['mnist-original.mat'] Data sets in `mldata.org <http://mldata.org>`_ do not adhere to a strict -naming or formatting convention. ``fetch_mldata`` is able to make sense -of the most common cases, but allows to tailor the defaults to individual -datasets: +naming or formatting convention. :func:`sklearn.datasets.fetch_mldata` is +able to make sense of the most common cases, but allows to tailor the +defaults to individual datasets: * The data arrays in `mldata.org <http://mldata.org>`_ are most often shaped as ``(n_features, n_samples)``. This is the opposite of the - ``scikit-learn`` convention, so ``fetch_mldata`` transposes the matrix - by default. The ``transpose_data`` keyword controls this behavior:: + ``scikit-learn`` convention, so :func:`sklearn.datasets.fetch_mldata` + transposes the matrix by default. The ``transpose_data`` keyword controls + this behavior:: >>> iris = fetch_mldata('iris', data_home=custom_data_home) >>> iris.data.shape @@ -55,12 +57,12 @@ datasets: >>> iris.data.shape (4, 150) -* For datasets with multiple columns, ``fetch_mldata`` tries to identify - the target and data columns and rename them to ``target`` and ``data``. - This is done by looking for arrays named ``label`` and ``data`` in the - dataset, and failing that by choosing the first array to be ``target`` - and the second to be ``data``. This behavior can be changed with the - ``target_name`` and ``data_name`` keywords, setting them to a specific +* For datasets with multiple columns, :func:`sklearn.datasets.fetch_mldata` + tries to identify the target and data columns and rename them to ``target`` + and ``data``. This is done by looking for arrays named ``label`` and + ``data`` in the dataset, and failing that by choosing the first array to be + ``target`` and the second to be ``data``. This behavior can be changed with + the ``target_name`` and ``data_name`` keywords, setting them to a specific name or index number (the name and order of the columns in the datasets can be found at its `mldata.org <http://mldata.org>`_ under the tab "Data":: diff --git a/doc/datasets/twenty_newsgroups.rst b/doc/datasets/twenty_newsgroups.rst index 003366efa4606e532587e7bb129fb21973cb7e1d..593f35978017d5c31160d0b76d26e8dc5486a080 100644 --- a/doc/datasets/twenty_newsgroups.rst +++ b/doc/datasets/twenty_newsgroups.rst @@ -10,22 +10,22 @@ between the train and test set is based upon a messages posted before and after a specific date. This module contains two loaders. The first one, -``sklearn.datasets.fetch_20newsgroups``, +:func:`sklearn.datasets.fetch_20newsgroups`, returns a list of the raw texts that can be fed to text feature extractors such as :class:`sklearn.feature_extraction.text.Vectorizer` with custom parameters so as to extract feature vectors. -The second one, ``sklearn.datasets.fetch_20newsgroups_vectorized``, +The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`, returns ready-to-use features, i.e., it is not necessary to use a feature extractor. Usage ----- -The ``sklearn.datasets.fetch_20newsgroups`` function is a data +The :func:`sklearn.datasets.fetch_20newsgroups` function is a data fetching / caching functions that downloads the data archive from the original `20 newsgroups website`_, extracts the archive contents in the ``~/scikit_learn_data/20news_home`` folder and calls the -``sklearn.datasets.load_file`` on either the training or +:func:`sklearn.datasets.load_files` on either the training or testing set folder, or both of them:: >>> from sklearn.datasets import fetch_20newsgroups @@ -65,7 +65,8 @@ attribute is the integer index of the category:: array([12, 6, 9, 8, 6, 7, 9, 2, 13, 19]) It is possible to load only a sub-selection of the categories by passing the -list of the categories to load to the ``fetch_20newsgroups`` function:: +list of the categories to load to the +:func:`sklearn.datasets.fetch_20newsgroups` function:: >>> cats = ['alt.atheism', 'sci.space'] >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats) @@ -106,7 +107,7 @@ components by sample in a more than 30000-dimensional space >>> vectors.nnz / float(vectors.shape[0]) 159.01327433628319 -``sklearn.datasets.fetch_20newsgroups_vectorized`` is a function which returns +:func:`sklearn.datasets.fetch_20newsgroups_vectorized` is a function which returns ready-to-use tfidf features instead of file names. .. _`20 newsgroups website`: http://people.csail.mit.edu/jrennie/20Newsgroups/ @@ -147,7 +148,7 @@ Let's take a look at what the most informative features are: ... for i, category in enumerate(categories): ... top10 = np.argsort(classifier.coef_[i])[-10:] ... print("%s: %s" % (category, " ".join(feature_names[top10]))) - ... + ... >>> show_top10(clf, vectorizer, newsgroups_train.target_names) alt.atheism: sgi livesey atheists writes people caltech com god keith edu comp.graphics: organization thanks files subject com image lines university edu graphics @@ -176,7 +177,7 @@ of each file. **remove** should be a tuple containing any subset of ``('headers', 'footers', 'quotes')``, telling it to remove headers, signature blocks, and quotation blocks respectively. - >>> newsgroups_test = fetch_20newsgroups(subset='test', + >>> newsgroups_test = fetch_20newsgroups(subset='test', ... remove=('headers', 'footers', 'quotes'), ... categories=categories) >>> vectors_test = vectorizer.transform(newsgroups_test.data)