From 03436275136d59296247ba51397b4a9e7e66d94d Mon Sep 17 00:00:00 2001
From: Andreas Mueller <amueller@ais.uni-bonn.de>
Date: Sat, 17 Dec 2011 23:41:43 +0100
Subject: [PATCH] DOC Description of the basic dataset API

---
 doc/datasets/index.rst | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/doc/datasets/index.rst b/doc/datasets/index.rst
index 992b8eb9bf..f44ce2de9c 100644
--- a/doc/datasets/index.rst
+++ b/doc/datasets/index.rst
@@ -26,6 +26,28 @@ This package also features helpers to fetch larger datasets commonly
 used by the machine learning community to benchmark algorithm on data
 that comes from the 'real world'.
 
+General dataset API
+===================
+There are three distinct kinds of dataset interfaces used at the moment.
+The simplest one is the interface for sample images, which is described
+below in the :ref: _Sample_images section.
+
+The dataset generation functions and the svmlight loader share a simplistic
+interface, returning a tuple ``(X, y)`` consisting of a n_samples x n_features
+numpy array X and an array of length n_samples containing the targets y.
+
+The toy datasets as well as the 'real world' datasets and the datasets
+fetched from mldata.org have more sophisticated structure.
+These functions return a ``bunch`` (which is a dictionary that is
+accessible with the 'dict.key' syntax).
+All datasets have at least two keys, ``data``, containg an array of shape
+``n_samples x n_features`` and ``target``, a numpy array of length ``n_features``,
+containing the targets.
+The datasets also contain a description in ``DESC`` and some contain
+``feature_names`` and ``target_names``.
+See the dataset descriptions below for details.
+
+
 Toy datasets
 ============
 
-- 
GitLab