Skip to content
Snippets Groups Projects
Commit ea752cd6 authored by He Chen's avatar He Chen Committed by Joel Nothman
Browse files

[MRG + 1] Added some documentation for loading external datasets (Issue 3808) (#7516)

* Update tutorial.rst

* Update tutorial.rst

* Update tutorial.rst

* Update tutorial.rst

* Update index.rst

* Update index.rst

* Update tutorial.rst

* Update tutorial.rst

* Update tutorial.rst

* Update faq.rst

* Update faq.rst

* Divided in two cases (standard columnar and misc data)

I also added a preprocessing note at the end

* Update tutorial.rst

* Update faq.rst

* Update index.rst

Also added some references that were in the original FAQ and pointed the FAQ to here

* Update index.rst

Added the information from the removed part of the FAQ because I felt that the FAQ version was better than the explanation I gave.

* Update index.rst

reference to skimage and also has sklearn.preprocessing.OneHotEncoder instead of OneHotEncoder

* Update index.rst

* Update index.rst

Changed with @jnothman's feedback 
https://github.com/scikit-learn/scikit-learn/pull/7516/files/d16ac523ed404188fc1f2529ac89050d4a974e3f

* Update faq.rst

* optimized file formats added to datasets/index.rst

Note: if you manage your own numerical data it is recommended to use an optimized file format such as HDF5 to reduce data load times. Various libraries such as H5Py, PyTables and pandas provides a Python interface for reading and writing data in that format.

- From the FAQ

* faq.rst: Moved the comment in bunch section to datasets index

This comment has been moved to the datasets index in the external_datasets section:
Note: if you manage your own numerical data it is recommended to use an optimized file format such as HDF5 to reduce data load times. Various libraries such as H5Py, PyTables and pandas provides a Python interface for reading and writing data in that format.

* Update index.rst

Included all changes mentioned by @amueller and @jnothman

* Update faq.rst

* Update faq.rst
parent 689a188d
No related branches found
No related tags found
No related merge requests found
...@@ -254,6 +254,58 @@ features:: ...@@ -254,6 +254,58 @@ features::
_`Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader _`Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader
.. _external_datasets:
Loading from external datasets
==============================
scikit-learn works on any numeric data stored as numpy arrays or scipy sparse
matrices. Other types that are convertible to numeric arrays such as pandas
DataFrame are also acceptable.
Here are some recommended ways to load standard columnar data into a
format usable by scikit-learn:
* `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_
provides tools to read data from common formats including CSV, Excel, JSON
and SQL. DataFrames may also be constructed from lists of tuples or dicts.
Pandas handles heterogeneous data smoothly and provides tools for
manipulation and conversion into a numeric array suitable for scikit-learn.
* `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_
specializes in binary formats often used in scientific computing
context such as .mat and .arff
* `numpy/routines.io <http://docs.scipy.org/doc/numpy/reference/routines.io.html>`_
for standard loading of columnar data into numpy arrays
* scikit-learn's :func:`datasets.load_svmlight_file` for the svmlight or libSVM
sparse format
* scikit-learn's :func:`datasets.load_files` for directories of text files where
the name of each directory is the name of each category and each file inside
of each directory corresponds to one sample from that category
For some miscellaneous data such as images, videos, and audio, you may wish to
refer to:
* `skimage.io <http://scikit-image.org/docs/dev/api/skimage.io.html>`_ or
`Imageio <http://imageio.readthedocs.io/en/latest/userapi.html>`_
for loading images and videos to numpy arrays
* `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy.
misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow
<https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities
data from various image file formats
* `scipy.io.wavfile.read
<http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html>`_
for reading WAV files into a numpy array
Categorical (or nominal) features stored as strings (common in pandas DataFrames)
will need converting to integers, and integer categorical variables may be best
exploited when encoded as one-hot variables
(:class:`sklearn.preprocessing.OneHotEncoder`) or similar.
See :ref:`preprocessing`.
Note: if you manage your own numerical data it is recommended to use an
optimized file format such as HDF5 to reduce data load times. Various libraries
such as H5Py, PyTables and pandas provides a Python interface for reading and
writing data in that format.
.. make sure everything is in a toc tree .. make sure everything is in a toc tree
......
...@@ -75,31 +75,15 @@ input variables and a 1D array ``y`` for the target variables. The array ``X`` ...@@ -75,31 +75,15 @@ input variables and a 1D array ``y`` for the target variables. The array ``X``
holds the features as columns and samples as rows . The array ``y`` contains holds the features as columns and samples as rows . The array ``y`` contains
integer values to encode the class membership of each sample in ``X``. integer values to encode the class membership of each sample in ``X``.
To load data as numpy arrays you can use different libraries depending on the How can I load my own datasets into a format usable by scikit-learn?
original data format: --------------------------------------------------------------------
* `numpy.loadtxt Generally, scikit-learn works on any numeric data stored as numpy arrays
<http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html>`_ to or scipy sparse matrices. Other types that are convertible to numeric
load text files (such as CSV) assuming that all the columns have an arrays such as pandas DataFrame are also acceptable.
homogeneous data type (e.g. all numeric values).
For more information on loading your data files into these usable data
* `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_ for common structures, please refer to :ref:`loading external datasets <external_datasets>`.
binary formats often used in scientific computing context.
* `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy.
misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow
<https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities
data from various image file formats.
* `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_ to load
heterogeneously typed data from various file formats and database protocols
that can slice and dice before conversion to numerical features in a numpy
array.
Note: if you manage your own numerical data it is recommended to use an
optimized file format such as HDF5 to reduce data load times. Various libraries
such as H5Py, PyTables and pandas provides a Python interface for reading and
writing data in that format.
What are the inclusion criteria for new algorithms ? What are the inclusion criteria for new algorithms ?
---------------------------------------------------- ----------------------------------------------------
......
...@@ -137,6 +137,9 @@ learn:: ...@@ -137,6 +137,9 @@ learn::
from the original problem one can shape the data for consumption in from the original problem one can shape the data for consumption in
scikit-learn. scikit-learn.
.. topic:: Loading from external datasets
To load from an external dataset, please refer to :ref:`loading external datasets <external_datasets>`.
Learning and predicting Learning and predicting
------------------------ ------------------------
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment