diff --git a/doc/developers/index.rst b/doc/developers/index.rst index f818f7a7723bbf5e6a72a87a322914b747975479..80e46ae41bcd0bd71711471f152f9280f4209511 100644 --- a/doc/developers/index.rst +++ b/doc/developers/index.rst @@ -44,7 +44,7 @@ additional utilities. Contributing code ================= -.. note: +.. note:: To avoid duplicated work it is highly advised to contact the developers mailing list before starting work on a non-trivial feature. @@ -103,6 +103,10 @@ rules before submitting a pull request: * Follow the `coding-guidelines`_ (see below). + * When applicable, use the Validation tools and other code in the + ``sklearn.utils`` submodule. A list of utility routines available + for developers can be found in the :ref:`developers-utils` page. + * All public methods should have informative docstrings with sample usage presented as doctests when appropriate. @@ -267,6 +271,7 @@ In addition, we add the following guidelines: <https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt>`_ in all your docstrings. + A good example of code that we like can be found `here <https://svn.enthought.com/enthought/browser/sandbox/docs/coding_standard.py>`_. @@ -286,6 +291,46 @@ In other cases, be sure to call ``safe_asarray``, ``atleast2d_or_csr``, scikit-learn API function. The exact function to use depends mainly on whether ``scipy.sparse`` matrices must be accepted. +For more information, refer to the :ref:`developers-utils` page. + +Random Numbers +-------------- + +If your code depends on a random number generator, do not use +``numpy.random.random()`` or similar routines. To ensure +repeatability in error checking, the routine should accept a keyword +``random_state`` and use this to construct a +``numpy.random.RandomState`` object. +See ``sklearn.utils.check_random_state`` in :ref:`developers-utils`. + +Here's a simple example of code using some of the above guidelines: + +:: + + from sklearn.utils import array2d, check_random_state + + def choose_random_sample(X, random_state=0): + """ + Choose a random point from X + + Parameters + ---------- + X : array-like, shape = (n_samples, n_features) + array representing the data + random_state : RandomState or an int seed (0 by default) + A random number generator instance to define the state of the + random permutations generator. + + Returns + ------- + x : numpy array, shape = (n_features,) + A random point selected from X + """ + X = array2d(X) + random_state = check_random_state(random_state) + i = random_state.randint(X.shape[0]) + return X[i] + APIs of scikit-learn objects ============================ diff --git a/doc/developers/utilities.rst b/doc/developers/utilities.rst new file mode 100644 index 0000000000000000000000000000000000000000..5f4364bcc1fddd880cd8392cae9639dcd32da0c3 --- /dev/null +++ b/doc/developers/utilities.rst @@ -0,0 +1,230 @@ +.. _developers-utils: + +======================== +Utilities for Developers +======================== +Scikit-learn contains a number of utilities to help with development. These +are located in ``sklearn.utils``, and include tools in a number of categories. +All the following functions and classes are in the module ``sklearn.utils``. + +Please note that these utilities are meant to be used internally within +scikit-learn. They are not guaranteed to be stable between versions of +scikit-learn. Backports, in particular, will be removed as the scikit-learn +dependencies evolve. + +.. currentmodule:: sklearn.utils + +Validation Tools +---------------- +These are tools used to check and validate input. When you write a function +which accepts arrays, matrices, or sparse matrices as arguments, the following +should be used when applicable. + +- :func:`assert_all_finite`: Throw an error if array contains NaNs or Infs. + +- :func:`safe_asarray`: Convert input to array or sparse matrix. Equivalent + to ``np.asarray``, but sparse matrices are passed through. + +- :func:`as_float_array`: convert input to an array of floats. If a sparse + matrix is passed, a sparse matrix will be returned. + +- :func:`array2d`: equivalent to ``np.atleast_2d``, but the ``order`` and + ``dtype`` of the input are maintained. + +- :func:`atleast2d_or_csr`: equivalent to ``array2d``, but if a sparse matrix + is passed, will convert to csr format. Also calls ``assert_all_finite``. + +- :func:`check_arrays`: check that all input arrays have consistent first + dimensions. This will work for an arbitrary number of arrays. + +- :func:`warn_if_not_float`: Warn if input is not a floating-point value. + the input ``X`` is assumed to have ``X.dtype``. + +If your code relies on a random number generator, it should never use +functions like ``numpy.random.random`` or ``numpy.random.normal``. This +approach can lead to repeatability issues in unit tests. Instead, a +``numpy.random.RandomState`` object should be used, which is built from +a ``random_state`` argument passed to the class or function. The function +:func:`check_random_state`, below, can then be used to create a random +number generator object. + +- :func:`check_random_state`: create a ``np.random.RandomState`` object from + a parameter ``random_state``. + + - If ``random_state`` is ``None`` or ``np.random``, then a + randomly-initialized ``RandomState`` object is returned. + - If ``random_state`` is an integer, then it is used to seed a new + ``RandomState`` object. + - If ``random_state`` is a ``RandomState`` object, then it is passed through. + +For example: + + >>> from sklearn.utils import check_random_state + >>> random_state = 0 + >>> random_state = check_random_state(random_state) + >>> random_state.rand(4) + array([ 0.5488135 , 0.71518937, 0.60276338, 0.54488318]) + + +Efficient Linear Algebra & Array Operations +------------------------------------------- + +- :func:`extmath.randomized_range_finder`: construct an orthonormal matrix + whose range approximates the range of the input. This is used in + :func:`extmath.fast_svd`, below. + +- :func:`extmath.fast_svd`: compute the k-truncated randomized SVD. + This algorithm finds the exact truncated singular values decomposition + using randomization to speed up the computations. It is particularly + fast on large matrices on which you wish to extract only a small + number of components. + +- :func:`arrayfuncs.cholesky_delete`: + (used in :func:`sklearn.linear_model.least_angle.lars_path`) Remove an + item from a cholesky factorization. + +- :func:`arrayfuncs.min_pos`: (used in ``sklearn.linear_model.least_angle``) + Find the minimum of the positive values within an array. + +- :func:`extmath.norm`: computes vector norm by directly calling the BLAS + ``nrm2`` function. This is more stable than ``scipy.linalg.norm``. See + `Fabian's blog post + <http://fseoane.net/blog/2011/computing-the-vector-norm/>`_ for a discussion. + +- :func:`extmath.fast_logdet`: efficiently compute the log of the determinant + of a matrix. + +- :func:`extmath.density`: efficiently compute the density of a sparse vector + +- :func:`extmath.safe_sparse_dot`: dot product which will correctly handle + ``scipy.sparse`` inputs. If the inputs are dense, it is equivalent to + ``numpy.dot``. + +- :func:`extmath.logsum`: compute the sum of X assuming X is in the log domain. + This is equivalent to calling ``np.log(np.sum(np.exp(X)))``, but is + robust to overflow/underflow errors. + +- :func:`extmath.weighted_mode`: an extension of ``scipy.stats.mode`` which + allows each item to have a real-valued weight. + +- :func:`resample`: Resample arrays or sparse matrices in a consistent way. + used in :func:`shuffle`, below. + +- :func:`shuffle`: Shuffle arrays or sparse matrices in a consistent way. + Used in ``sklearn.cluster.k_means``. + +Graph Routines +-------------- + +- :func:`graph.single_source_shortest_path_length`: + (not currently used in scikit-learn) + Return the shortest path from a single source + to all connected nodes on a graph. Code is adapted from networkx. + If this is ever needed again, it would be far faster to use a single + iteration of Dijkstra's algorithm from ``graph_shortest_path``. + +- :func:`graph.graph_laplacian`: + (used in :func:`sklearn.cluster.spectral.spectral_embedding`) + Return the Laplacian of a given graph. There is specialized code for + both dense and sparse connectivity matrices. + +- :func:`graph_shortest_path.graph_shortest_path`: + (used in :class:``sklearn.manifold.Isomap``) + Return the shortest path between all pairs of connected points on a directed + or undirected graph. Both the Floyd-Warshall algorithm and Dijkstra's + algorithm are available. The algorithm is most efficient when the + connectivity matrix is a ``scipy.sparse.csr_matrix``. + +Backports +--------- + +- :class:`fixes.Counter` (partial backport of ``collections.Counter`` from + Python 2.7) Used in ``sklearn.feature_extraction.text``. + +- :func:`fixes.unique`: (backport of ``np.unique`` from numpy 1.4). Find the + unique entries in an array. In numpy versions < 1.4, ``np.unique`` is less + flexible. Used in ``sklearn.cross_validation``. + +- :func:`fixes.copysign`: (backport of ``np.copysign`` from numpy 1.4). + Change the sign of ``x1`` to that of ``x2``, element-wise. + +- :func:`fixes.in1d`: (backport of ``np.in1d`` from numpy 1.4). + Test whether each element of an array is in a second array. Used in + ``sklearn.datasets.twenty_newsgroups`` and + ``sklearn.feature_extraction.image``. + +- :func:`fixes.savemat` (backport of ``scipy.io.savemat`` from scipy 0.7.2). + Save an array in MATLAB-format. In earlier versions, the keyword + ``oned_as`` is not available. + +- :func:`fixes.count_nonzero` (backport of ``np.count_nonzero`` from + numpy 1.6). Count the nonzero elements of a matrix. Used in + tests of ``sklearn.linear_model``. + +- :func:`arrayfuncs.solve_triangular` + (Back-ported from scipy v0.9) Used in ``sklearn.linear_model.omp``, + independent back-ports in ``sklearn.mixture.gmm`` and + ``sklearn.gaussian_process`` + +- :func:`sparsetools.cs_graph_components` + (backported from ``scipy.sparse.cs_graph_components`` in scipy 0.9). + Used in ``sklearn.cluster.hierarchical``, as well as in tests for + ``sklearn.feature_extraction``. + +ARPACK +~~~~~~ + +- :func:`arpack.eigs` + (backported from ``scipy.sparse.linalg.eigs`` in scipy 0.10) + Sparse non-symmetric eigenvalue decomposition using the Arnoldi + method. A limited version of ``eigs`` is available in earlier + scipy versions. + +- :func:`arpack.eigsh` + (backported from ``scipy.sparse.linalg.eigsh`` in scipy 0.10) + Sparse non-symmetric eigenvalue decomposition using the Arnoldi + method. A limited version of ``eigsh`` is available in earlier + scipy versions. + +- :func:`arpack.svds` + (backported from ``scipy.sparse.linalg.svds`` in scipy 0.10) + Sparse non-symmetric eigenvalue decomposition using the Arnoldi + method. A limited version of ``svds`` is available in earlier + scipy versions. + +- :func:`fixes.arpack_eigsh` [TODO: remove this from spectral_clustering + in favor of the above back-port] + +Benchmarking +~~~~~~~~~~~~ + +- :func:`bench.total_seconds` (back-ported from ``timedelta.total_seconds`` + in Python 2.7). Used in ``benchmarks/bench_glm.py`` + +Testing Functions +----------------- + +- :func:`testing.assert_in`: Compare string elements within lists. + Used in ``sklearn.datasets`` tests. + +- :class:`mock_urllib2`: Object which mocks the urllib2 module to fake + requests of mldata. Used in tests of ``sklearn.datasets``. + +Helper Functions +---------------- + +- :class:`gen_even_slices`: generator to create ``n``-packs of slices going up + to ``n``. Used in ``sklearn.decomposition.dict_learning`` and + ``sklearn.cluster.k_means``. + +- :class:`arraybuilder.ArrayBuilder`: Helper class to incrementally build + a 1-d numpy.ndarray. Currently used in + ``sklearn.datasets._svmlight_format.pyx`` + +Warnings and Exceptions +----------------------- + +- :class:`deprecated`: Decorator to mark a function or class as deprecated. + +- :class:`ConvergenceWarning`: Custom warning to catch convergence problems. + Used in ``sklearn.covariance.graph_lasso``