diff --git a/doc/datasets/index.rst b/doc/datasets/index.rst index 88b1a4b517b36f213f05e2f273a79da83904070b..d864dd1c9b479c782f3f1e7204b2c699cda178da 100644 --- a/doc/datasets/index.rst +++ b/doc/datasets/index.rst @@ -38,6 +38,7 @@ require to download any file from some external website. :toctree: generated/ :template: function.rst + load_boston load_iris load_diabetes load_digits diff --git a/doc/tutorial.rst b/doc/tutorial.rst index a9119c4ebc0f2082f51869ce6aa4b1b933b31854..ad6709cb8e80a6fac2216fd9b22232d516aab27f 100644 --- a/doc/tutorial.rst +++ b/doc/tutorial.rst @@ -58,9 +58,10 @@ Loading an example dataset -------------------------- `scikits.learn` comes with a few standard datasets, for instance the -`iris dataset <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_, or -the `digits dataset -<http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_:: +`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits +<http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_ +datasets for classification and the `boston house prices dataset +<http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.:: >>> from scikits.learn import datasets >>> iris = datasets.load_iris() diff --git a/scikits/learn/datasets/base.py b/scikits/learn/datasets/base.py index 9447a4581735e2629c7091a2e29ea5cc26d12146..62c0699974a515f1509556fe4994f8664943583b 100644 --- a/scikits/learn/datasets/base.py +++ b/scikits/learn/datasets/base.py @@ -320,7 +320,7 @@ def load_linnerud(): def load_boston(): - """Load the Boston house prices dataset and return it. + """Load and return the boston house-prices dataset (regression). Returns ------- diff --git a/scikits/learn/datasets/descr/boston_house_prices.rst b/scikits/learn/datasets/descr/boston_house_prices.rst index c0c8b29c551980552f0e73d6caa831c3eab96866..804e0e01554216c180ac37d1e6b37a1bb02bc5bc 100644 --- a/scikits/learn/datasets/descr/boston_house_prices.rst +++ b/scikits/learn/datasets/descr/boston_house_prices.rst @@ -1,39 +1,53 @@ Boston House Prices dataset -Source +Notes ------ - http://lib.stat.cmu.edu/datasets/boston +Data Set Characteristics: - The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic - prices and the demand for clean air', J. Environ. Economics & Management, - vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics - ...', Wiley, 1980. N.B. Various transformations are used in the table on - pages 244-261 of the latter. + :Number of Instances: 506 + :Number of Attributes: 13 numeric/categorical predictive + + :Median Value (attribute 14) is usually the target + :Attribute Information (in order): + - CRIM per capita crime rate by town + - ZN proportion of residential land zoned for lots over 25,000 sq.ft. + - INDUS proportion of non-retail business acres per town + - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) + - NOX nitric oxides concentration (parts per 10 million) + - RM average number of rooms per dwelling + - AGE proportion of owner-occupied units built prior to 1940 + - DIS weighted distances to five Boston employment centres + - RAD index of accessibility to radial highways + - TAX full-value property-tax rate per $10,000 + - PTRATIO pupil-teacher ratio by town + - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town + - LSTAT % lower status of the population + - MEDV Median value of owner-occupied homes in $1000's + + :Missing Attribute Values: None + + :Creator: Harrison, D. and Rubinfeld, D.L. + +This is a copy of UCI ML housing dataset. +http://archive.ics.uci.edu/ml/datasets/Housing + + +This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. + +The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic +prices and the demand for clean air', J. Environ. Economics & Management, +vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics +...', Wiley, 1980. N.B. Various transformations are used in the table on +pages 244-261 of the latter. + +The Boston house-price data has been used in many machine learning papers that address regression +problems. -Number of Instances: 452 - -Number of Attributes: 14 numeric, predictive attributes - -Attribute 14 (Median Value) is usually the target - -Attribute Information: - Variables in order: - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's - -Summary Statistics: - TODO +References +---------- + + - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. + - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann. + - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)