[MRG + 1] Added some documentation for loading external datasets (Issue 3808) (#7516)

* Update tutorial.rst * Update tutorial.rst * Update tutorial.rst * Update tutorial.rst * Update index.rst * Update index.rst * Update tutorial.rst * Update tutorial.rst * Update tutorial.rst * Update faq.rst * Update faq.rst * Divided in two cases (standard columnar and misc data) I also added a preprocessing note at the end * Update tutorial.rst * Update faq.rst * Update index.rst Also added some references that were in the original FAQ and pointed the FAQ to here * Update index.rst Added the information from the removed part of the FAQ because I felt that the FAQ version was better than the explanation I gave. * Update index.rst reference to skimage and also has sklearn.preprocessing.OneHotEncoder instead of OneHotEncoder * Update index.rst * Update index.rst Changed with @jnothman's feedback https://github.com/scikit-learn/scikit-learn/pull/7516/files/d16ac523ed404188fc1f2529ac89050d4a974e3f * Update faq.rst * optimized file formats added to datasets/index.rst Note: if you manage your own numerical data it is recommended to use an optimized file format such as HDF5 to reduce data load times. Various libraries such as H5Py, PyTables and pandas provides a Python interface for reading and writing data in that format. - From the FAQ * faq.rst: Moved the comment in bunch section to datasets index This comment has been moved to the datasets index in the external_datasets section: Note: if you manage your own numerical data it is recommended to use an optimized file format such as HDF5 to reduce data load times. Various libraries such as H5Py, PyTables and pandas provides a Python interface for reading and writing data in that format. * Update index.rst Included all changes mentioned by @amueller and @jnothman * Update faq.rst * Update faq.rst

[MRG + 1] Added some documentation for loading external datasets (Issue 3808) (#7516)
ea752cd6 · He Chen · Joel Nothman · 689a188d · ea752cd6 · ea752cd6
Commit ea752cd6 authored 8 years ago by He Chen Committed by Joel Nothman 8 years ago
--- a/doc/datasets/index.rst
+++ b/doc/datasets/index.rst
@@ -254,6 +254,58 @@ features::
 _`Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader
+.. _external_datasets:
+Loading from external datasets
+==============================
+scikit-learn works on any numeric data stored as numpy arrays or scipy sparse
+matrices. Other types that are convertible to numeric arrays such as pandas
+DataFrame are also acceptable.
+Here are some recommended ways to load standard columnar data into a 
+format usable by scikit-learn: 
+* `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_ 
+  provides tools to read data from common formats including CSV, Excel, JSON
+  and SQL. DataFrames may also be constructed from lists of tuples or dicts.
+  Pandas handles heterogeneous data smoothly and provides tools for
+  manipulation and conversion into a numeric array suitable for scikit-learn.
+* `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_ 
+  specializes in binary formats often used in scientific computing 
+  context such as .mat and .arff
+* `numpy/routines.io <http://docs.scipy.org/doc/numpy/reference/routines.io.html>`_
+  for standard loading of columnar data into numpy arrays
+* scikit-learn's :func:`datasets.load_svmlight_file` for the svmlight or libSVM
+  sparse format
+* scikit-learn's :func:`datasets.load_files` for directories of text files where
+  the name of each directory is the name of each category and each file inside
+  of each directory corresponds to one sample from that category
+For some miscellaneous data such as images, videos, and audio, you may wish to
+refer to:
+* `skimage.io <http://scikit-image.org/docs/dev/api/skimage.io.html>`_ or
+  `Imageio <http://imageio.readthedocs.io/en/latest/userapi.html>`_ 
+  for loading images and videos to numpy arrays
+* `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy.
+  misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow
+  <https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities
+  data from various image file formats
+* `scipy.io.wavfile.read 
+  <http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html>`_ 
+  for reading WAV files into a numpy array
+Categorical (or nominal) features stored as strings (common in pandas DataFrames) 
+will need converting to integers, and integer categorical variables may be best 
+exploited when encoded as one-hot variables 
+(:class:`sklearn.preprocessing.OneHotEncoder`) or similar. 
+See :ref:`preprocessing`.
+Note: if you manage your own numerical data it is recommended to use an 
+optimized file format such as HDF5 to reduce data load times. Various libraries
+such as H5Py, PyTables and pandas provides a Python interface for reading and 
+writing data in that format.
 .. make sure everything is in a toc tree

--- a/doc/faq.rst
+++ b/doc/faq.rst
@@ -75,31 +75,15 @@ input variables and a 1D array ``y`` for the target variables. The array ``X``
 holds the features as columns and samples as rows . The array ``y`` contains
 integer values to encode the class membership of each sample in ``X``.
-To load data as numpy arrays you can use different libraries depending on the
+How can I load my own datasets into a format usable by scikit-learn?
-original data format:
+--------------------------------------------------------------------
-* `numpy.loadtxt
+Generally, scikit-learn works on any numeric data stored as numpy arrays
-  <http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html>`_ to
+or scipy sparse matrices. Other types that are convertible to numeric 
-  load text files (such as CSV) assuming that all the columns have an
+arrays such as pandas DataFrame are also acceptable.
-  homogeneous data type (e.g. all numeric values).
+For more information on loading your data files into these usable data 
-* `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_ for common
+structures, please refer to :ref:`loading external datasets <external_datasets>`.
-  binary formats often used in scientific computing context.
-* `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy.
-  misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow
-  <https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities
-  data from various image file formats.
-* `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_ to load
-  heterogeneously typed data from various file formats and database protocols
-  that can slice and dice before conversion to numerical features in a numpy
-  array.
-Note: if you manage your own numerical data it is recommended to use an
-optimized file format such as HDF5 to reduce data load times. Various libraries
-such as H5Py, PyTables and pandas provides a Python interface for reading and
-writing data in that format.
 What are the inclusion criteria for new algorithms ?
 ----------------------------------------------------

--- a/doc/tutorial/basic/tutorial.rst
+++ b/doc/tutorial/basic/tutorial.rst
@@ -137,6 +137,9 @@ learn::
    from the original problem one can shape the data for consumption in
    scikit-learn.
+.. topic:: Loading from external datasets
+    To load from an external dataset, please refer to :ref:`loading external datasets <external_datasets>`.
 Learning and predicting
 ------------------------