diff --git a/doc/datasets/index.rst b/doc/datasets/index.rst index b3f329e943cdb79657065d751e1baadbc8dca337..c624fdb55f2e5e8321c41369a64c6dfae31c6014 100644 --- a/doc/datasets/index.rst +++ b/doc/datasets/index.rst @@ -254,6 +254,58 @@ features:: _`Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader +.. _external_datasets: + +Loading from external datasets +============================== + +scikit-learn works on any numeric data stored as numpy arrays or scipy sparse +matrices. Other types that are convertible to numeric arrays such as pandas +DataFrame are also acceptable. + +Here are some recommended ways to load standard columnar data into a +format usable by scikit-learn: + +* `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_ + provides tools to read data from common formats including CSV, Excel, JSON + and SQL. DataFrames may also be constructed from lists of tuples or dicts. + Pandas handles heterogeneous data smoothly and provides tools for + manipulation and conversion into a numeric array suitable for scikit-learn. +* `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_ + specializes in binary formats often used in scientific computing + context such as .mat and .arff +* `numpy/routines.io <http://docs.scipy.org/doc/numpy/reference/routines.io.html>`_ + for standard loading of columnar data into numpy arrays +* scikit-learn's :func:`datasets.load_svmlight_file` for the svmlight or libSVM + sparse format +* scikit-learn's :func:`datasets.load_files` for directories of text files where + the name of each directory is the name of each category and each file inside + of each directory corresponds to one sample from that category + +For some miscellaneous data such as images, videos, and audio, you may wish to +refer to: + +* `skimage.io <http://scikit-image.org/docs/dev/api/skimage.io.html>`_ or + `Imageio <http://imageio.readthedocs.io/en/latest/userapi.html>`_ + for loading images and videos to numpy arrays +* `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy. + misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow + <https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities + data from various image file formats +* `scipy.io.wavfile.read + <http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html>`_ + for reading WAV files into a numpy array + +Categorical (or nominal) features stored as strings (common in pandas DataFrames) +will need converting to integers, and integer categorical variables may be best +exploited when encoded as one-hot variables +(:class:`sklearn.preprocessing.OneHotEncoder`) or similar. +See :ref:`preprocessing`. + +Note: if you manage your own numerical data it is recommended to use an +optimized file format such as HDF5 to reduce data load times. Various libraries +such as H5Py, PyTables and pandas provides a Python interface for reading and +writing data in that format. .. make sure everything is in a toc tree diff --git a/doc/faq.rst b/doc/faq.rst index 16101bc5c9ba7ed767350c7b4c9de56c11677828..7a4a2f2a8fd4a8be0dd1f6f6e039bde1d6b6e12a 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -75,31 +75,15 @@ input variables and a 1D array ``y`` for the target variables. The array ``X`` holds the features as columns and samples as rows . The array ``y`` contains integer values to encode the class membership of each sample in ``X``. -To load data as numpy arrays you can use different libraries depending on the -original data format: - -* `numpy.loadtxt - <http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html>`_ to - load text files (such as CSV) assuming that all the columns have an - homogeneous data type (e.g. all numeric values). - -* `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_ for common - binary formats often used in scientific computing context. - -* `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy. - misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow - <https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities - data from various image file formats. - -* `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_ to load - heterogeneously typed data from various file formats and database protocols - that can slice and dice before conversion to numerical features in a numpy - array. - -Note: if you manage your own numerical data it is recommended to use an -optimized file format such as HDF5 to reduce data load times. Various libraries -such as H5Py, PyTables and pandas provides a Python interface for reading and -writing data in that format. +How can I load my own datasets into a format usable by scikit-learn? +-------------------------------------------------------------------- + +Generally, scikit-learn works on any numeric data stored as numpy arrays +or scipy sparse matrices. Other types that are convertible to numeric +arrays such as pandas DataFrame are also acceptable. + +For more information on loading your data files into these usable data +structures, please refer to :ref:`loading external datasets <external_datasets>`. What are the inclusion criteria for new algorithms ? ---------------------------------------------------- diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst index 799f05c1148406fbcabca042589337a12515ee03..439343d30c4df8951d0f9bdc993cf803204aad1e 100644 --- a/doc/tutorial/basic/tutorial.rst +++ b/doc/tutorial/basic/tutorial.rst @@ -136,7 +136,10 @@ learn:: <sphx_glr_auto_examples_classification_plot_digits_classification.py>` illustrates how starting from the original problem one can shape the data for consumption in scikit-learn. + +.. topic:: Loading from external datasets + To load from an external dataset, please refer to :ref:`loading external datasets <external_datasets>`. Learning and predicting ------------------------