Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
  • 0.19.X
  • discrete
  • 0.18.X
  • ignore_lambda_to_diff_errors
  • 0.17.X
  • authors-update
  • 0.16.X
  • 0.15.X
  • 0.14.X
  • debian
  • 0.13.X
  • 0.12.X
  • 0.11.X
  • 0.10.X
  • 0.9.X
  • 0.6.X
  • 0.7.X
  • 0.8.X
  • 0.19.1
  • 0.19.0
  • 0.19b2
  • 0.19b1
  • 0.19-branching
  • 0.18.2
  • 0.18.1
  • 0.18
  • 0.18rc2
  • 0.18rc1
  • 0.18rc
  • 0.17.1-1
  • 0.17.1
  • debian/0.17.0-4
  • debian/0.17.0-3
  • debian/0.17.0-1
  • 0.17
  • debian/0.17.0_b1+git14-g4e6829c-1
  • debian/0.17.0_b1-1
  • 0.17b1
39 results

index.rst

Blame
  • index.rst 5.53 KiB

    Dataset loading utilities

    The sklearn.datasets package embeds some small toy datasets as introduced in the :ref:`Getting Started <loading_example_dataset>` section.

    To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data.

    This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithm on data that comes from the 'real world'.

    General dataset API

    There are three distinct kinds of dataset interfaces for different types of datasets. The simplest one is the interface for sample images, which is described below in the :ref:`sample_images` section.

    The dataset generation functions and the svmlight loader share a simplistic interface, returning a tuple (X, y) consisting of a n_samples x n_features numpy array X and an array of length n_samples containing the targets y.

    The toy datasets as well as the 'real world' datasets and the datasets fetched from mldata.org have more sophisticated structure. These functions return a bunch (which is a dictionary that is accessible with the 'dict.key' syntax). All datasets have at least two keys, data, containg an array of shape n_samples x n_features (except for 20newsgroups) and target, a numpy array of length n_features, containing the targets.

    The datasets also contain a description in DESCR and some contain feature_names and target_names. See the dataset descriptions below for details.

    Toy datasets

    scikit-learn comes with a few small standard datasets that do not require to download any file from some external website.

    These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks.

    Sample images

    The scikit also embed a couple of sample JPEG images published under Creative Commons license by their authors. Those image can be useful to test algorithms and pipeline on 2D data.

    ../auto_examples/cluster/images/plot_color_quantization_1.png

    Warning

    The default coding of images is based on the uint8 dtype to spare memory. Often machine learning algorithms work best if the input is converted to a floating point representation first. Also, if you plan to use pylab.imshow don't forget to scale to the range 0 - 1 as done in the following example.

    Examples:

    • :ref:`example_cluster_plot_color_quantization.py`

    Sample generators

    In addition, scikit-learn includes various random sample generators that can be used to build artifical datasets of controled size and complexity.

    ../auto_examples/images/plot_random_dataset_1.png

    Datasets in svmlight / libsvm format

    scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each line takes the form <label> <feature-id>:<feature-value> <feature-id>:<feature-value> .... This format is especially suitable for sparse datasets. In this module, scipy sparse CSR matrices are used for X and numpy arrays are used for y.

    You may load a dataset like as follows:

    >>> from sklearn.datasets import load_svmlight_file
    >>> X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
    ...                                                         # doctest: +SKIP

    You may also load two (or more) datasets at once:

    >>> X_train, y_train, X_test, y_test = load_svmlight_files(
    ...     ("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))
    ...                                                         # doctest: +SKIP

    In this case, X_train and X_test are guaranteed to have the same number of features. Another way to achieve the same result is to fix the number of features:

    >>> X_test, y_test = load_svmlight_file(
    ...     "/path/to/test_dataset.txt", n_features=X_train.shape[1])
    ...                                                         # doctest: +SKIP

    Related links:

    Public datasets in svmlight / libsvm format: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

    Faster API-compatible implementation: https://github.com/mblondel/svmlight-loader