diff --git a/doc/datasets/index.rst b/doc/datasets/index.rst
index cc258422a421177d257e00378e77b1b43c09b043..7bff294e52048dcae1657a576ea4670b25aebf86 100644
--- a/doc/datasets/index.rst
+++ b/doc/datasets/index.rst
@@ -81,7 +81,7 @@ and pipeline on 2D data.
    load_sample_images
    load_sample_image
 
-.. image:: ../auto_examples/cluster/images/plot_color_quantization_1.png
+.. image:: ../auto_examples/cluster/images/plot_color_quantization_001.png
    :target: ../auto_examples/cluster/plot_color_quantization.html
    :scale: 30
    :align: right
@@ -108,7 +108,7 @@ Sample generators
 In addition, scikit-learn includes various random sample generators that
 can be used to build artificial datasets of controlled size and complexity.
 
-.. image:: ../auto_examples/datasets/images/plot_random_dataset_1.png
+.. image:: ../auto_examples/datasets/images/plot_random_dataset_001.png
    :target: ../auto_examples/datasets/plot_random_dataset.html
    :scale: 50
    :align: center
diff --git a/doc/modules/biclustering.rst b/doc/modules/biclustering.rst
index 1ad5eae4862f8daff2a2ba41a7910c6f0e240a4b..e4583b451e17f8b799d4ef0b6d9ba3e49978b959 100644
--- a/doc/modules/biclustering.rst
+++ b/doc/modules/biclustering.rst
@@ -44,8 +44,8 @@ biclusters on the diagonal. Here is an example of this structure
 where biclusters have higher average values than the other rows and
 columns:
 
-.. figure:: ../auto_examples/bicluster/images/plot_spectral_coclustering_3.png
-   :target: ../auto_examples/bicluster/images/plot_spectral_coclustering_3.png
+.. figure:: ../auto_examples/bicluster/images/plot_spectral_coclustering_003.png
+   :target: ../auto_examples/bicluster/images/plot_spectral_coclustering_003.png
    :align: center
    :scale: 50
 
@@ -56,8 +56,8 @@ each column belongs to all row clusters. Here is an example of this
 structure where the variance of the values within each bicluster is
 small:
 
-.. figure:: ../auto_examples/bicluster/images/plot_spectral_biclustering_3.png
-   :target: ../auto_examples/bicluster/images/plot_spectral_biclustering_3.png
+.. figure:: ../auto_examples/bicluster/images/plot_spectral_biclustering_003.png
+   :target: ../auto_examples/bicluster/images/plot_spectral_biclustering_003.png
    :align: center
    :scale: 50
 
diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
index 6ed6ac742f90e4ed5bb8b733aebc078946740d7b..4cb91923aafeff52799036fab32fca80b8eff978 100644
--- a/doc/modules/clustering.rst
+++ b/doc/modules/clustering.rst
@@ -33,7 +33,7 @@ data can be found in the ``labels_`` attribute.
 Overview of clustering methods
 ===============================
 
-.. figure:: ../auto_examples/cluster/images/plot_cluster_comparison_1.png
+.. figure:: ../auto_examples/cluster/images/plot_cluster_comparison_001.png
    :target: ../auto_examples/cluster/plot_cluster_comparison.html
    :align: center
    :scale: 50
@@ -161,7 +161,7 @@ and the new centroids are computed and the algorithm repeats these last two
 steps until this value is less than a threshold. In other words, it repeats
 until the centroids do not move significantly.
 
-.. image:: ../auto_examples/cluster/images/plot_kmeans_digits_1.png
+.. image:: ../auto_examples/cluster/images/plot_kmeans_digits_001.png
    :target: ../auto_examples/cluster/plot_kmeans_digits.html
    :align: right
    :scale: 35
@@ -245,7 +245,7 @@ convergence or a predetermined number of iterations is reached.
 of the results is reduced. In practice this difference in quality can be quite
 small, as shown in the example and cited reference.
 
-.. figure:: ../auto_examples/cluster/images/plot_mini_batch_kmeans_1.png
+.. figure:: ../auto_examples/cluster/images/plot_mini_batch_kmeans_001.png
    :target: ../auto_examples/cluster/plot_mini_batch_kmeans.html
    :align: center
    :scale: 100
@@ -283,7 +283,7 @@ values from other pairs. This updating happens iteratively until convergence,
 at which point the final exemplars are chosen, and hence the final clustering
 is given.
 
-.. figure:: ../auto_examples/cluster/images/plot_affinity_propagation_1.png
+.. figure:: ../auto_examples/cluster/images/plot_affinity_propagation_001.png
    :target: ../auto_examples/cluster/plot_affinity_propagation.html
    :align: center
    :scale: 50
@@ -384,7 +384,7 @@ Labelling a new sample is performed by finding the nearest centroid for a
 given sample.
 
 
-.. figure:: ../auto_examples/cluster/images/plot_mean_shift_1.png
+.. figure:: ../auto_examples/cluster/images/plot_mean_shift_001.png
    :target: ../auto_examples/cluster/plot_mean_shift.html
    :align: center
    :scale: 50
@@ -424,11 +424,11 @@ graph vertices are pixels, and edges of the similarity graph are a
 function of the gradient of the image.
 
 
-.. |noisy_img| image:: ../auto_examples/cluster/images/plot_segmentation_toy_1.png
+.. |noisy_img| image:: ../auto_examples/cluster/images/plot_segmentation_toy_001.png
     :target: ../auto_examples/cluster/plot_segmentation_toy.html
     :scale: 50
 
-.. |segmented_img| image:: ../auto_examples/cluster/images/plot_segmentation_toy_2.png
+.. |segmented_img| image:: ../auto_examples/cluster/images/plot_segmentation_toy_002.png
     :target: ../auto_examples/cluster/plot_segmentation_toy.html
     :scale: 50
 
@@ -455,11 +455,11 @@ function of the gradient of the image.
  * :ref:`example_cluster_plot_lena_segmentation.py`: Spectral clustering
    to split the image of lena in regions.
 
-.. |lena_kmeans| image:: ../auto_examples/cluster/images/plot_lena_segmentation_1.png
+.. |lena_kmeans| image:: ../auto_examples/cluster/images/plot_lena_segmentation_001.png
     :target: ../auto_examples/cluster/plot_lena_segmentation.html
     :scale: 65
 
-.. |lena_discretize| image:: ../auto_examples/cluster/images/plot_lena_segmentation_2.png
+.. |lena_discretize| image:: ../auto_examples/cluster/images/plot_lena_segmentation_002.png
     :target: ../auto_examples/cluster/plot_lena_segmentation.html
     :scale: 65
 
@@ -545,15 +545,15 @@ Different linkage type: Ward, complete and average linkage
 :class:`AgglomerativeClustering` supports Ward, average, and complete
 linkage strategies.
 
-.. image:: ../auto_examples/cluster/images/plot_digits_linkage_1.png
+.. image:: ../auto_examples/cluster/images/plot_digits_linkage_001.png
     :target: ../auto_examples/cluster/plot_digits_linkage.html
     :scale: 43
 
-.. image:: ../auto_examples/cluster/images/plot_digits_linkage_2.png
+.. image:: ../auto_examples/cluster/images/plot_digits_linkage_002.png
     :target: ../auto_examples/cluster/plot_digits_linkage.html
     :scale: 43
 
-.. image:: ../auto_examples/cluster/images/plot_digits_linkage_3.png
+.. image:: ../auto_examples/cluster/images/plot_digits_linkage_003.png
     :target: ../auto_examples/cluster/plot_digits_linkage.html
     :scale: 43
 
@@ -582,11 +582,11 @@ constraints forbid the merging of points that are not adjacent on the swiss
 roll, and thus avoid forming clusters that extend across overlapping folds of
 the roll.
 
-.. |unstructured| image:: ../auto_examples/cluster/images/plot_ward_structured_vs_unstructured_1.png
+.. |unstructured| image:: ../auto_examples/cluster/images/plot_ward_structured_vs_unstructured_001.png
         :target: ../auto_examples/cluster/plot_ward_structured_vs_unstructured.html
         :scale: 49
 
-.. |structured| image:: ../auto_examples/cluster/images/plot_ward_structured_vs_unstructured_2.png
+.. |structured| image:: ../auto_examples/cluster/images/plot_ward_structured_vs_unstructured_002.png
         :target: ../auto_examples/cluster/plot_ward_structured_vs_unstructured.html
         :scale: 49
 
@@ -634,19 +634,19 @@ enable only merging of neighboring pixels on an image, as in the
     clusters and almost empty ones. (see the discussion in
     :ref:`example_cluster_plot_agglomerative_clustering.py`).
 
-.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_1.png
+.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_001.png
     :target: ../auto_examples/cluster/plot_agglomerative_clustering.html
     :scale: 38
 
-.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_2.png
+.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_002.png
     :target: ../auto_examples/cluster/plot_agglomerative_clustering.html
     :scale: 38
 
-.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_3.png
+.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_003.png
     :target: ../auto_examples/cluster/plot_agglomerative_clustering.html
     :scale: 38
 
-.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_4.png
+.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_004.png
     :target: ../auto_examples/cluster/plot_agglomerative_clustering.html
     :scale: 38
 
@@ -670,15 +670,15 @@ The guidelines for choosing a metric is to use one that maximizes the
 distance between samples in different classes, and minimizes that within
 each class.
 
-.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_metrics_5.png
+.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_metrics_005.png
     :target: ../auto_examples/cluster/plot_agglomerative_clustering_metrics.html
     :scale: 32
 
-.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_metrics_6.png
+.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_metrics_006.png
     :target: ../auto_examples/cluster/plot_agglomerative_clustering_metrics.html
     :scale: 32
 
-.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_metrics_7.png
+.. image:: ../auto_examples/cluster/images/plot_agglomerative_clustering_metrics_007.png
     :target: ../auto_examples/cluster/plot_agglomerative_clustering_metrics.html
     :scale: 32
 
@@ -728,7 +728,7 @@ indicating core samples found by the algorithm. Smaller circles are non-core
 samples that are still part of a cluster. Moreover, the outliers are indicated
 by black points below.
 
-.. |dbscan_results| image:: ../auto_examples/cluster/images/plot_dbscan_1.png
+.. |dbscan_results| image:: ../auto_examples/cluster/images/plot_dbscan_001.png
         :target: ../auto_examples/cluster/plot_dbscan.html
         :scale: 50
 
@@ -1169,7 +1169,7 @@ Drawbacks
   smaller sample sizes or larger number of clusters it is safer to use
   an adjusted index such as the Adjusted Rand Index (ARI)**.
 
-.. figure:: ../auto_examples/cluster/images/plot_adjusted_for_chance_measures_1.png
+.. figure:: ../auto_examples/cluster/images/plot_adjusted_for_chance_measures_001.png
    :target: ../auto_examples/cluster/plot_adjusted_for_chance_measures.html
    :align: center
    :scale: 100
diff --git a/doc/modules/computational_performance.rst b/doc/modules/computational_performance.rst
index f421c40fac684e6a8727eea55c9f1806acf3e512..cc5a792a47d572cc3a69f4d78eb25d79570a478d 100644
--- a/doc/modules/computational_performance.rst
+++ b/doc/modules/computational_performance.rst
@@ -51,13 +51,13 @@ linear algebra libraries optimizations etc.). Here we see on a setting
 with few features that independently of estimator choice the bulk mode is
 always faster, and for some of them by 1 to 2 orders of magnitude:
 
-.. |atomic_prediction_latency| image::  ../auto_examples/applications/images/plot_prediction_latency_1.png
+.. |atomic_prediction_latency| image::  ../auto_examples/applications/images/plot_prediction_latency_001.png
     :target: ../auto_examples/applications/plot_prediction_latency.html
     :scale: 80
 
 .. centered:: |atomic_prediction_latency|
 
-.. |bulk_prediction_latency| image::  ../auto_examples/applications/images/plot_prediction_latency_2.png
+.. |bulk_prediction_latency| image::  ../auto_examples/applications/images/plot_prediction_latency_002.png
     :target: ../auto_examples/applications/plot_prediction_latency.html
     :scale: 80
 
@@ -79,7 +79,7 @@ From a computing perspective it also means that the number of basic operations
 too. Here is a graph of the evolution of the prediction latency with the
 number of features:
 
-.. |influence_of_n_features_on_latency| image::  ../auto_examples/applications/images/plot_prediction_latency_3.png
+.. |influence_of_n_features_on_latency| image::  ../auto_examples/applications/images/plot_prediction_latency_003.png
     :target: ../auto_examples/applications/plot_prediction_latency.html
     :scale: 80
 
@@ -148,7 +148,7 @@ describe it fully. Of course sparsity influences in turn the prediction time
 as the sparse dot-product takes time roughly proportional to the number of
 non-zero coefficients.
 
-.. |en_model_complexity| image::  ../auto_examples/applications/images/plot_model_complexity_influence_1.png
+.. |en_model_complexity| image::  ../auto_examples/applications/images/plot_model_complexity_influence_001.png
     :target: ../auto_examples/applications/plot_model_complexity_influence.html
     :scale: 80
 
@@ -163,7 +163,7 @@ support vector. In the following graph the ``nu`` parameter of
 :class:`sklearn.svm.classes.NuSVR` was used to influence the number of
 support vectors.
 
-.. |nusvr_model_complexity| image::  ../auto_examples/applications/images/plot_model_complexity_influence_2.png
+.. |nusvr_model_complexity| image::  ../auto_examples/applications/images/plot_model_complexity_influence_002.png
     :target: ../auto_examples/applications/plot_model_complexity_influence.html
     :scale: 80
 
@@ -175,7 +175,7 @@ important role. Latency and throughput should scale linearly with the number
 of trees. In this case we used directly the ``n_estimators`` parameter of
 :class:`sklearn.ensemble.gradient_boosting.GradientBoostingRegressor`.
 
-.. |gbt_model_complexity| image::  ../auto_examples/applications/images/plot_model_complexity_influence_3.png
+.. |gbt_model_complexity| image::  ../auto_examples/applications/images/plot_model_complexity_influence_003.png
     :target: ../auto_examples/applications/plot_model_complexity_influence.html
     :scale: 80
 
@@ -199,7 +199,7 @@ files, tokenizing the text and hashing it into a common vector space) is
 taking 100 to 500 times more time than the actual prediction code, depending on
 the chosen model.
 
- .. |prediction_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_4.png
+ .. |prediction_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_004.png
     :target: ../auto_examples/applications/plot_out_of_core_classification.html
     :scale: 80
 
@@ -218,7 +218,7 @@ time. Here is a benchmark from the
 :ref:`example_applications_plot_prediction_latency.py` example that measures
 this quantity for a number of estimators on synthetic data:
 
-.. |throughput_benchmark| image::  ../auto_examples/applications/images/plot_prediction_latency_4.png
+.. |throughput_benchmark| image::  ../auto_examples/applications/images/plot_prediction_latency_004.png
     :target: ../auto_examples/applications/plot_prediction_latency.html
     :scale: 80
 
diff --git a/doc/modules/covariance.rst b/doc/modules/covariance.rst
index 9ccb8c964e0b0a48f61874ab76d01e11d5ee7423..9da5be0a4083be27145bbf37ddb2f5bf180aa053 100644
--- a/doc/modules/covariance.rst
+++ b/doc/modules/covariance.rst
@@ -133,7 +133,7 @@ with the :meth:`oas` function of the `sklearn.covariance`
 package, or it can be otherwise obtained by fitting an :class:`OAS`
 object to the same sample.
 
-.. figure:: ../auto_examples/covariance/images/plot_covariance_estimation_1.png
+.. figure:: ../auto_examples/covariance/images/plot_covariance_estimation_001.png
    :target: ../auto_examples/covariance/plot_covariance_estimation.html
    :align: center
    :scale: 65%
@@ -155,7 +155,7 @@ object to the same sample.
      an :class:`OAS` estimator of the covariance.
 
 
-.. figure:: ../auto_examples/covariance/images/plot_lw_vs_oas_1.png
+.. figure:: ../auto_examples/covariance/images/plot_lw_vs_oas_001.png
    :target: ../auto_examples/covariance/plot_lw_vs_oas.html
    :align: center
    :scale: 75%
@@ -187,7 +187,7 @@ the precision matrix: the higher its ``alpha`` parameter, the more sparse
 the precision matrix. The corresponding :class:`GraphLassoCV` object uses
 cross-validation to automatically set the ``alpha`` parameter.
 
-.. figure:: ../auto_examples/covariance/images/plot_sparse_cov_1.png
+.. figure:: ../auto_examples/covariance/images/plot_sparse_cov_001.png
    :target: ../auto_examples/covariance/plot_sparse_cov.html
    :align: center
    :scale: 60%
@@ -308,11 +308,11 @@ attributes of a :class:`MinCovDet` robust covariance estimator object.
      :class:`MinCovDet` covariance estimators in terms of Mahalanobis distance
      (so we get a better estimate of the precision matrix too).
 
-.. |robust_vs_emp| image:: ../auto_examples/covariance/images/plot_robust_vs_empirical_covariance_1.png
+.. |robust_vs_emp| image:: ../auto_examples/covariance/images/plot_robust_vs_empirical_covariance_001.png
    :target: ../auto_examples/covariance/plot_robust_vs_empirical_covariance.html
    :scale: 49%
 
-.. |mahalanobis| image:: ../auto_examples/covariance/images/plot_mahalanobis_distances_1.png
+.. |mahalanobis| image:: ../auto_examples/covariance/images/plot_mahalanobis_distances_001.png
    :target: ../auto_examples/covariance/plot_mahalanobis_distances.html
    :scale: 49%
 
diff --git a/doc/modules/cross_decomposition.rst b/doc/modules/cross_decomposition.rst
index caa3bccdfed4815bb1d4eb4960d54d60c9366ed8..c55a2168458a06108e66940df2551bf64d160f24 100644
--- a/doc/modules/cross_decomposition.rst
+++ b/doc/modules/cross_decomposition.rst
@@ -13,7 +13,7 @@ These families of algorithms are useful to find linear relations between two
 multivariate datasets: the ``X`` and ``Y`` arguments of the ``fit`` method
 are 2D arrays.
 
-.. figure:: ../auto_examples/cross_decomposition/images/plot_compare_cross_decomposition_1.png
+.. figure:: ../auto_examples/cross_decomposition/images/plot_compare_cross_decomposition_001.png
    :target: ../auto_examples/cross_decomposition/plot_compare_cross_decomposition.html
    :scale: 75%
    :align: center
diff --git a/doc/modules/decomposition.rst b/doc/modules/decomposition.rst
index 33732aaa0630dbdbcab19d2803e5bf66892fc1be..74c47b4a2c70862a39f75b5dacbdc8f25b77342a 100644
--- a/doc/modules/decomposition.rst
+++ b/doc/modules/decomposition.rst
@@ -34,7 +34,7 @@ longer exact since some information is lost while forward transforming.
 Below is an example of the iris dataset, which is comprised of 4
 features, projected on the 2 dimensions that explain most variance:
 
-.. figure:: ../auto_examples/decomposition/images/plot_pca_vs_lda_1.png
+.. figure:: ../auto_examples/decomposition/images/plot_pca_vs_lda_001.png
     :target: ../auto_examples/decomposition/plot_pca_vs_lda.html
     :align: center
     :scale: 75%
@@ -45,7 +45,7 @@ probabilistic interpretation of the PCA that can give a likelihood of
 data based on the amount of variance it explains. As such it implements a
 `score` method that can be used in cross-validation:
 
-.. figure:: ../auto_examples/decomposition/images/plot_pca_vs_fa_model_selection_1.png
+.. figure:: ../auto_examples/decomposition/images/plot_pca_vs_fa_model_selection_001.png
     :target: ../auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
     :align: center
     :scale: 75%
@@ -89,11 +89,11 @@ singular vectors reshaped as portraits. Since we only require the top
 and :math:`n_{features} = 64 \times 64 = 4096`, the computation time it
 less than 1s:
 
-.. |orig_img| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_1.png
+.. |orig_img| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_001.png
    :target: ../auto_examples/decomposition/plot_faces_decomposition.html
    :scale: 60%
 
-.. |pca_img| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_2.png
+.. |pca_img| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_002.png
    :target: ../auto_examples/decomposition/plot_faces_decomposition.html
    :scale: 60%
 
@@ -147,7 +147,7 @@ applications including denoising, compression and structured prediction
 (kernel dependency estimation). :class:`KernelPCA` supports both
 ``transform`` and ``inverse_transform``.
 
-.. figure:: ../auto_examples/decomposition/images/plot_kernel_pca_1.png
+.. figure:: ../auto_examples/decomposition/images/plot_kernel_pca_001.png
     :target: ../auto_examples/decomposition/plot_kernel_pca.html
     :align: center
     :scale: 75%
@@ -197,7 +197,7 @@ norms that take into account adjacency and different kinds of structure; see
 For more details on how to use Sparse PCA, see the Examples section, below.
 
 
-.. |spca_img| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_5.png
+.. |spca_img| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_005.png
    :target: ../auto_examples/decomposition/plot_faces_decomposition.html
    :scale: 60%
 
@@ -401,11 +401,11 @@ dictionary fixed, and then updating the dictionary to best fit the sparse code.
                 0 \leq k < n_{atoms}
 
 
-.. |pca_img2| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_2.png
+.. |pca_img2| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_002.png
    :target: ../auto_examples/decomposition/plot_faces_decomposition.html
    :scale: 60%
 
-.. |dict_img2| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_6.png
+.. |dict_img2| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_006.png
    :target: ../auto_examples/decomposition/plot_faces_decomposition.html
    :scale: 60%
 
@@ -420,7 +420,7 @@ The following image shows how a dictionary learned from 4x4 pixel image patches
 extracted from part of the image of Lena looks like.
 
 
-.. figure:: ../auto_examples/decomposition/images/plot_image_denoising_1.png
+.. figure:: ../auto_examples/decomposition/images/plot_image_denoising_001.png
     :target: ../auto_examples/decomposition/plot_image_denoising.html
     :align: center
     :scale: 50%
@@ -458,7 +458,7 @@ does not fit into the memory.
 
 .. currentmodule:: sklearn.cluster
 
-.. image:: ../auto_examples/cluster/images/plot_dict_face_patches_1.png
+.. image:: ../auto_examples/cluster/images/plot_dict_face_patches_001.png
     :target: ../auto_examples/cluster/plot_dict_face_patches.html
     :scale: 50%
     :align: right
@@ -533,11 +533,11 @@ Factor analysis *can* produce similar components (the columns of its loading
 matrix) to :class:`PCA`. However, one can not make any general statements
 about these components (e.g. whether they are orthogonal):
 
-.. |pca_img3| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_2.png
+.. |pca_img3| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_002.png
     :target: ../auto_examples/decomposition/plot_faces_decomposition.html
     :scale: 60%
 
-.. |fa_img3| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_9.png
+.. |fa_img3| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_009.png
     :target: ../auto_examples/decomposition/plot_faces_decomposition.html
     :scale: 60%
 
@@ -547,7 +547,7 @@ The main advantage for Factor Analysis (over :class:`PCA` is that
 it can model the variance in every direction of the input space independently
 (heteroscedastic noise):
 
-.. figure:: ../auto_examples/decomposition/images/plot_faces_decomposition_8.png
+.. figure:: ../auto_examples/decomposition/images/plot_faces_decomposition_008.png
     :target: ../auto_examples/decomposition/plot_faces_decomposition.html
     :align: center
     :scale: 75%
@@ -555,7 +555,7 @@ it can model the variance in every direction of the input space independently
 This allows better model selection than probabilistic PCA in the presence
 of heteroscedastic noise:
 
-.. figure:: ../auto_examples/decomposition/images/plot_pca_vs_fa_model_selection_2.png
+.. figure:: ../auto_examples/decomposition/images/plot_pca_vs_fa_model_selection_002.png
     :target: ../auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
     :align: center
     :scale: 75%
@@ -582,7 +582,7 @@ of the PCA variants.
 It is classically used to separate mixed signals (a problem known as
 *blind source separation*), as in the example below:
 
-.. figure:: ../auto_examples/decomposition/images/plot_ica_blind_source_separation_1.png
+.. figure:: ../auto_examples/decomposition/images/plot_ica_blind_source_separation_001.png
     :target: ../auto_examples/decomposition/plot_ica_blind_source_separation.html
     :align: center
     :scale: 60%
@@ -591,11 +591,11 @@ It is classically used to separate mixed signals (a problem known as
 ICA can also be used as yet another non linear decomposition that finds
 components with some sparsity:
 
-.. |pca_img4| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_2.png
+.. |pca_img4| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_002.png
     :target: ../auto_examples/decomposition/plot_faces_decomposition.html
     :scale: 60%
 
-.. |ica_img4| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_4.png
+.. |ica_img4| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_004.png
     :target: ../auto_examples/decomposition/plot_faces_decomposition.html
     :scale: 60%
 
@@ -639,11 +639,11 @@ resulting in interpretable models. The following example displays 16
 sparse components found by :class:`NMF` from the images in the Olivetti
 faces dataset, in comparison with the PCA eigenfaces.
 
-.. |pca_img5| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_2.png
+.. |pca_img5| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_002.png
     :target: ../auto_examples/decomposition/plot_faces_decomposition.html
     :scale: 60%
 
-.. |nmf_img5| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_3.png
+.. |nmf_img5| image:: ../auto_examples/decomposition/images/plot_faces_decomposition_003.png
     :target: ../auto_examples/decomposition/plot_faces_decomposition.html
     :scale: 60%
 
diff --git a/doc/modules/density.rst b/doc/modules/density.rst
index ff91abad858b534069af345fd72314ba8ab27ffe..c9f5c271f7f15ec2190181671130f2c3779903e9 100644
--- a/doc/modules/density.rst
+++ b/doc/modules/density.rst
@@ -24,7 +24,7 @@ A histogram is a simple visualization of data where bins are defined, and the
 number of data points within each bin is tallied.  An example of a histogram
 can be seen in the upper-left panel of the following figure:
 
-.. |hist_to_kde| image:: ../auto_examples/neighbors/images/plot_kde_1d_1.png
+.. |hist_to_kde| image:: ../auto_examples/neighbors/images/plot_kde_1d_001.png
    :target: ../auto_examples/neighbors/plot_kde_1d.html
    :scale: 80
 
@@ -68,7 +68,7 @@ dimensionality causes its performance to degrade in high dimensions.
 In the following figure, 100 points are drawn from a bimodal distribution,
 and the kernel density estimates are shown for three choices of kernels:
 
-.. |kde_1d_distribution| image:: ../auto_examples/neighbors/images/plot_kde_1d_3.png
+.. |kde_1d_distribution| image:: ../auto_examples/neighbors/images/plot_kde_1d_003.png
    :target: ../auto_examples/neighbors/plot_kde_1d.html
    :scale: 80
 
@@ -103,7 +103,7 @@ to an unsmooth (i.e. high-variance) density distribution.
 :class:`sklearn.neighbors.KernelDensity` implements several common kernel
 forms, which are shown in the following figure:
 
-.. |kde_kernels| image:: ../auto_examples/neighbors/images/plot_kde_1d_2.png
+.. |kde_kernels| image:: ../auto_examples/neighbors/images/plot_kde_1d_002.png
    :target: ../auto_examples/neighbors/plot_kde_1d.html
    :scale: 80
 
@@ -145,7 +145,7 @@ is an example of using a kernel density estimate for a visualization
 of geospatial data, in this case the distribution of observations of two
 different species on the South American continent:
 
-.. |species_kde| image:: ../auto_examples/neighbors/images/plot_species_kde_1.png
+.. |species_kde| image:: ../auto_examples/neighbors/images/plot_species_kde_001.png
    :target: ../auto_examples/neighbors/plot_species_kde.html
    :scale: 80
 
@@ -158,7 +158,7 @@ Here is an example of using this process to
 create a new set of hand-written digits, using a Gaussian kernel learned
 on a PCA projection of the data:
 
-.. |digits_kde| image:: ../auto_examples/neighbors/images/plot_digits_kde_sampling_1.png
+.. |digits_kde| image:: ../auto_examples/neighbors/images/plot_digits_kde_sampling_001.png
    :target: ../auto_examples/neighbors/plot_digits_kde_sampling.html
    :scale: 80
 
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 9296d0c376b013b3a245b96a3b0ea91af6b92bed..963a1837cc37312466b89eb556d68bb49cfed241 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -181,7 +181,7 @@ in bias::
     >>> scores.mean() > 0.999
     True
 
-.. figure:: ../auto_examples/ensemble/images/plot_forest_iris_1.png
+.. figure:: ../auto_examples/ensemble/images/plot_forest_iris_001.png
     :target: ../auto_examples/ensemble/plot_forest_iris.html
     :align: center
     :scale: 75%
@@ -257,7 +257,7 @@ The following example shows a color-coded representation of the relative
 importances of each individual pixel for a face recognition task using
 a :class:`ExtraTreesClassifier` model.
 
-.. figure:: ../auto_examples/ensemble/images/plot_forest_importances_faces_1.png
+.. figure:: ../auto_examples/ensemble/images/plot_forest_importances_faces_001.png
    :target: ../auto_examples/ensemble/plot_forest_importances_faces.html
    :align: center
    :scale: 75
@@ -333,7 +333,7 @@ ever-increasing influence. Each subsequent weak learner is thereby forced to
 concentrate on the examples that are missed by the previous ones in the sequence
 [HTF]_.
 
-.. figure:: ../auto_examples/ensemble/images/plot_adaboost_hastie_10_2_1.png
+.. figure:: ../auto_examples/ensemble/images/plot_adaboost_hastie_10_2_001.png
    :target: ../auto_examples/ensemble/plot_adaboost_hastie_10_2.html
    :align: center
    :scale: 75
@@ -497,7 +497,7 @@ to determine the optimal number of trees (i.e. ``n_estimators``) by early stoppi
 The plot on the right shows the feature importances which can be obtained via
 the ``feature_importances_`` property.
 
-.. figure:: ../auto_examples/ensemble/images/plot_gradient_boosting_regression_1.png
+.. figure:: ../auto_examples/ensemble/images/plot_gradient_boosting_regression_001.png
    :target: ../auto_examples/ensemble/plot_gradient_boosting_regression.html
    :align: center
    :scale: 75
@@ -694,7 +694,7 @@ outperforms no-shrinkage. Subsampling with shrinkage can further increase
 the accuracy of the model. Subsampling without shrinkage, on the other hand,
 does poorly.
 
-.. figure:: ../auto_examples/ensemble/images/plot_gradient_boosting_regularization_1.png
+.. figure:: ../auto_examples/ensemble/images/plot_gradient_boosting_regularization_001.png
    :target: ../auto_examples/ensemble/plot_gradient_boosting_regularization.html
    :align: center
    :scale: 75
@@ -785,7 +785,7 @@ usually chosen among the most important features.
 The Figure below shows four one-way and one two-way partial dependence plots
 for the California housing dataset:
 
-.. figure:: ../auto_examples/ensemble/images/plot_partial_dependence_1.png
+.. figure:: ../auto_examples/ensemble/images/plot_partial_dependence_001.png
    :target: ../auto_examples/ensemble/plot_partial_dependence.html
    :align: center
    :scale: 70
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
index 4043338990356b27d1a30906bfd7b811cf9c9449..b757883fb015e18153259e55beb0195405f2aad3 100644
--- a/doc/modules/feature_extraction.rst
+++ b/doc/modules/feature_extraction.rst
@@ -890,7 +890,7 @@ features or samples. For instance Ward clustering
 (:ref:`hierarchical_clustering`) can cluster together only neighboring pixels
 of an image, thus forming contiguous patches:
 
-.. figure:: ../auto_examples/cluster/images/plot_lena_ward_segmentation_1.png
+.. figure:: ../auto_examples/cluster/images/plot_lena_ward_segmentation_001.png
    :target: ../auto_examples/cluster/plot_lena_ward_segmentation.html
    :align: center
    :scale: 40
diff --git a/doc/modules/feature_selection.rst b/doc/modules/feature_selection.rst
index fb6aedcc6b44f72a43f181d8ff81aed0ae019ddd..0fc8a5abf0532f15948250b7964c30fb535e1b8b 100644
--- a/doc/modules/feature_selection.rst
+++ b/doc/modules/feature_selection.rst
@@ -210,7 +210,7 @@ settings, using the Lasso, while :class:`RandomizedLogisticRegression` uses the
 logistic regression and is suitable for classification tasks.  To get a full
 path of stability scores you can use :func:`lasso_stability_path`.
 
-.. figure:: ../auto_examples/linear_model/images/plot_sparse_recovery_2.png
+.. figure:: ../auto_examples/linear_model/images/plot_sparse_recovery_002.png
    :target: ../auto_examples/linear_model/plot_sparse_recovery.html
    :align: center
    :scale: 60
diff --git a/doc/modules/gaussian_process.rst b/doc/modules/gaussian_process.rst
index aad5aad45699117ebf6f95350a7d48ab99fe6c65..a272dd177fa1eca27e6b51e42bc7a74fb8fe3760 100644
--- a/doc/modules/gaussian_process.rst
+++ b/doc/modules/gaussian_process.rst
@@ -59,7 +59,7 @@ data. Depending on the number of parameters provided at instantiation, the
 fitting procedure may recourse to maximum likelihood estimation for the
 parameters or alternatively it uses the given parameters.
 
-.. figure:: ../auto_examples/gaussian_process/images/plot_gp_regression_1.png
+.. figure:: ../auto_examples/gaussian_process/images/plot_gp_regression_001.png
    :target: ../auto_examples/gaussian_process/plot_gp_regression.html
    :align: center
 
@@ -100,7 +100,7 @@ equivalent to specifying a fractional variance in the input.  That is
 With ``nugget`` and ``corr`` properly set, Gaussian Processes can be
 used to robustly recover an underlying function from noisy data:
 
-.. figure:: ../auto_examples/gaussian_process/images/plot_gp_regression_2.png
+.. figure:: ../auto_examples/gaussian_process/images/plot_gp_regression_002.png
    :target: ../auto_examples/gaussian_process/plot_gp_regression.html
    :align: center
 
diff --git a/doc/modules/isotonic.rst b/doc/modules/isotonic.rst
index c781beaa186dde7bcb0d56f1e73889ce0ddbb2f8..9da18e4f069a20b1b46aa38342de2462e0a3db14 100644
--- a/doc/modules/isotonic.rst
+++ b/doc/modules/isotonic.rst
@@ -18,6 +18,6 @@ arbitrary real number. It yields the vector which is composed of non-decreasing
 elements the closest in terms of mean squared error. In practice this list
 of elements forms a function that is piecewise linear.
 
-.. figure:: ../auto_examples/images/plot_isotonic_regression_1.png
+.. figure:: ../auto_examples/images/plot_isotonic_regression_001.png
    :target: ../auto_examples/images/plot_isotonic_regression.html
    :align: center
diff --git a/doc/modules/kernel_approximation.rst b/doc/modules/kernel_approximation.rst
index a6ce6f44ab6026f3a9969c0f51b8cede98e45dcc..148d09df58894f93e97354bd8ddc4a3086e6c77f 100644
--- a/doc/modules/kernel_approximation.rst
+++ b/doc/modules/kernel_approximation.rst
@@ -83,7 +83,7 @@ For a given value of ``n_components`` :class:`RBFSampler` is often less accurate
 as :class:`Nystroem`. :class:`RBFSampler` is cheaper to compute, though, making
 use of larger feature spaces more efficient.
 
-.. figure:: ../auto_examples/images/plot_kernel_approximation_2.png
+.. figure:: ../auto_examples/images/plot_kernel_approximation_002.png
     :target: ../auto_examples/plot_kernel_approximation.html
     :scale: 50%
     :align: center
diff --git a/doc/modules/label_propagation.rst b/doc/modules/label_propagation.rst
index 38f94c61f6d9f83572a204865757f10c2d1a51ec..80f865f01c4d4c5928d261287dd6ce69bf788c09 100644
--- a/doc/modules/label_propagation.rst
+++ b/doc/modules/label_propagation.rst
@@ -37,7 +37,7 @@ A few features available in this model:
 :class:`LabelPropagation` and :class:`LabelSpreading`. Both work by
 constructing a similarity graph over all items in the input dataset. 
 
-.. figure:: ../auto_examples/semi_supervised/images/plot_label_propagation_structure_1.png
+.. figure:: ../auto_examples/semi_supervised/images/plot_label_propagation_structure_001.png
     :target: ../auto_examples/semi_supervised/plot_label_propagation_structure.html
     :align: center
     :scale: 60%
diff --git a/doc/modules/lda_qda.rst b/doc/modules/lda_qda.rst
index 89b29c206bc240fc64ca28ae08acfc5adf39227b..2706cbb405b3b31077003d4d9b36e222ea97e663 100644
--- a/doc/modules/lda_qda.rst
+++ b/doc/modules/lda_qda.rst
@@ -16,7 +16,7 @@ can be easily computed, are inherently multiclass,
 and have proven to work well in practice.
 Also there are no parameters to tune for these algorithms.
 
-.. |ldaqda| image:: ../auto_examples/images/plot_lda_qda_1.png
+.. |ldaqda| image:: ../auto_examples/images/plot_lda_qda_001.png
         :target: ../auto_examples/plot_lda_qda.html
         :scale: 80
 
diff --git a/doc/modules/learning_curve.rst b/doc/modules/learning_curve.rst
index 176813ec373ca601a132ff491517c1ee66b1cef4..19b97945468f05b908613af1545f2e190f79e183 100644
--- a/doc/modules/learning_curve.rst
+++ b/doc/modules/learning_curve.rst
@@ -21,7 +21,7 @@ the second estimator approximates it almost perfectly and the last estimator
 approximates the training data perfectly but does not fit the true function
 very well, i.e. it is very sensitive to varying training data (high variance).
 
-.. figure:: ../auto_examples/images/plot_underfitting_overfitting_1.png
+.. figure:: ../auto_examples/images/plot_underfitting_overfitting_001.png
    :target: ../auto_examples/plot_underfitting_overfitting.html
    :align: center
    :scale: 50%
@@ -98,7 +98,7 @@ training score and a high validation score is usually not possible. All three
 cases can be found in the plot below where we vary the parameter
 :math:`\gamma` of an SVM on the digits dataset.
 
-.. figure:: ../auto_examples/images/plot_validation_curve_1.png
+.. figure:: ../auto_examples/images/plot_validation_curve_001.png
    :target: ../auto_examples/plot_validation_curve.html
    :align: center
    :scale: 50%
@@ -118,7 +118,7 @@ size of the training set, we will not benefit much from more training data.
 In the following plot you can see an example: naive Bayes roughly converges
 to a low score.
 
-.. figure:: ../auto_examples/images/plot_learning_curve_1.png
+.. figure:: ../auto_examples/images/plot_learning_curve_001.png
    :target: ../auto_examples/plot_learning_curve.html
    :align: center
    :scale: 50%
@@ -130,7 +130,7 @@ the maximum number of training samples, adding more training samples will
 most likely increase generalization. In the following plot you can see that
 the SVM could benefit from more training examples.
 
-.. figure:: ../auto_examples/images/plot_learning_curve_2.png
+.. figure:: ../auto_examples/images/plot_learning_curve_002.png
    :target: ../auto_examples/plot_learning_curve.html
    :align: center
    :scale: 50%
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index c6dc3584f7e35e046d32c88c7f21fbe28814acaa..0ad9dc78cc3a2f5b395ddba6d28ba54972967cc5 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -33,7 +33,7 @@ solves a problem of the form:
 
 .. math:: \underset{w}{min\,} {|| X w - y||_2}^2
 
-.. figure:: ../auto_examples/linear_model/images/plot_ols_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_ols_001.png
    :target: ../auto_examples/linear_model/plot_ols.html
    :align: center
    :scale: 50%
@@ -90,7 +90,7 @@ Here, :math:`\alpha \geq 0` is a complexity parameter that controls the amount
 of shrinkage: the larger the value of :math:`\alpha`, the greater the amount
 of shrinkage and thus the coefficients become more robust to collinearity.
 
-.. figure:: ../auto_examples/linear_model/images/plot_ridge_path_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_ridge_path_001.png
    :target: ../auto_examples/linear_model/plot_ridge_path.html
    :align: center
    :scale: 50%
@@ -230,11 +230,11 @@ the advantage of exploring more relevant values of `alpha` parameter, and
 if the number of samples is very small compared to the number of
 observations, it is often faster than :class:`LassoCV`.
 
-.. |lasso_cv_1| image:: ../auto_examples/linear_model/images/plot_lasso_model_selection_2.png
+.. |lasso_cv_1| image:: ../auto_examples/linear_model/images/plot_lasso_model_selection_002.png
     :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
     :scale: 48%
 
-.. |lasso_cv_2| image:: ../auto_examples/linear_model/images/plot_lasso_model_selection_3.png
+.. |lasso_cv_2| image:: ../auto_examples/linear_model/images/plot_lasso_model_selection_003.png
     :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
     :scale: 48%
 
@@ -255,7 +255,7 @@ is correct, i.e. that the data are actually generated by this model.
 They also tend to break when the problem is badly conditioned
 (more features than samples).
 
-.. figure:: ../auto_examples/linear_model/images/plot_lasso_model_selection_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_lasso_model_selection_001.png
     :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
     :align: center
     :scale: 50%
@@ -289,7 +289,7 @@ The objective function to minimize is in this case
     \frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}
 
 
-.. figure:: ../auto_examples/linear_model/images/plot_lasso_coordinate_descent_path_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_lasso_coordinate_descent_path_001.png
    :target: ../auto_examples/linear_model/plot_lasso_coordinate_descent_path.html
    :align: center
    :scale: 50%
@@ -318,11 +318,11 @@ with a simple Lasso or a MultiTaskLasso. The Lasso estimates yields
 scattered non-zeros while the non-zeros of the MultiTaskLasso are full
 columns.
 
-.. |multi_task_lasso_1| image:: ../auto_examples/linear_model/images/plot_multi_task_lasso_support_1.png
+.. |multi_task_lasso_1| image:: ../auto_examples/linear_model/images/plot_multi_task_lasso_support_001.png
     :target: ../auto_examples/linear_model/plot_multi_task_lasso_support.html
     :scale: 48%
 
-.. |multi_task_lasso_2| image:: ../auto_examples/linear_model/images/plot_multi_task_lasso_support_2.png
+.. |multi_task_lasso_2| image:: ../auto_examples/linear_model/images/plot_multi_task_lasso_support_002.png
     :target: ../auto_examples/linear_model/plot_multi_task_lasso_support.html
     :scale: 48%
 
@@ -399,7 +399,7 @@ algorithm, and unlike the implementation based on coordinate_descent,
 this yields the exact solution, which is piecewise linear as a
 function of the norm of its coefficients.
 
-.. figure:: ../auto_examples/linear_model/images/plot_lasso_lars_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_lasso_lars_001.png
    :target: ../auto_examples/linear_model/plot_lasso_lars.html
    :align: center
    :scale: 50%
@@ -556,7 +556,7 @@ log likelihood*.
 By default :math:`\alpha_1 = \alpha_2 =  \lambda_1 = \lambda_2 = 1.e^{-6}`.
 
 
-.. figure:: ../auto_examples/linear_model/images/plot_bayesian_ridge_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_bayesian_ridge_001.png
    :target: ../auto_examples/linear_model/plot_bayesian_ridge.html
    :align: center
    :scale: 50%
@@ -623,7 +623,7 @@ has its own standard deviation :math:`\lambda_i`. The prior over all
 :math:`\lambda_i` is chosen to be the same gamma distribution given by
 hyperparameters :math:`\lambda_1` and :math:`\lambda_2`.
 
-.. figure:: ../auto_examples/linear_model/images/plot_ard_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_ard_001.png
    :target: ../auto_examples/linear_model/plot_ard.html
    :align: center
    :scale: 50%
@@ -752,7 +752,7 @@ which may be subject to noise, and outliers, which are e.g. caused by erroneous
 measurements or invalid hypotheses about the data. The resulting model is then
 estimated only from the determined inliers.
 
-.. figure:: ../auto_examples/linear_model/images/plot_ransac_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_ransac_001.png
    :target: ../auto_examples/linear_model/plot_ransac.html
    :align: center
    :scale: 50%
@@ -841,7 +841,7 @@ flexibility to fit a much broader range of data.
 Here is an example of applying this idea to one-dimensional data, using
 polynomial features of varying degrees:
 
-.. figure:: ../auto_examples/linear_model/images/plot_polynomial_interpolation_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_polynomial_interpolation_001.png
    :target: ../auto_examples/linear_model/plot_polynomial_interpolation.html
    :align: center
    :scale: 50%
diff --git a/doc/modules/manifold.rst b/doc/modules/manifold.rst
index 32921f0d0e408b13c96967eb2c004c2edc15ed0e..20c4328baede6cab6f80ef5e93b8cd31ed3a5c67 100644
--- a/doc/modules/manifold.rst
+++ b/doc/modules/manifold.rst
@@ -20,7 +20,7 @@ Manifold learning
 
 
 
-.. figure:: ../auto_examples/manifold/images/plot_compare_methods_1.png
+.. figure:: ../auto_examples/manifold/images/plot_compare_methods_001.png
    :target: ../auto_examples/manifold/plot_compare_methods.html
    :align: center
    :scale: 60
@@ -46,11 +46,11 @@ to be desired.  In a random projection, it is likely that the more
 interesting structure within the data will be lost.
 
 
-.. |digits_img| image:: ../auto_examples/manifold/images/plot_lle_digits_1.png
+.. |digits_img| image:: ../auto_examples/manifold/images/plot_lle_digits_001.png
     :target: ../auto_examples/manifold/plot_lle_digits.html
     :scale: 50
 
-.. |projected_img| image::  ../auto_examples/manifold/images/plot_lle_digits_2.png
+.. |projected_img| image::  ../auto_examples/manifold/images/plot_lle_digits_002.png
     :target: ../auto_examples/manifold/plot_lle_digits.html
     :scale: 50
 
@@ -66,11 +66,11 @@ These methods can be powerful, but often miss important non-linear
 structure in the data.
 
 
-.. |PCA_img| image:: ../auto_examples/manifold/images/plot_lle_digits_3.png
+.. |PCA_img| image:: ../auto_examples/manifold/images/plot_lle_digits_003.png
     :target: ../auto_examples/manifold/plot_lle_digits.html
     :scale: 50
 
-.. |LDA_img| image::  ../auto_examples/manifold/images/plot_lle_digits_4.png
+.. |LDA_img| image::  ../auto_examples/manifold/images/plot_lle_digits_004.png
     :target: ../auto_examples/manifold/plot_lle_digits.html
     :scale: 50
 
@@ -106,7 +106,7 @@ Isomap seeks a lower-dimensional embedding which maintains geodesic
 distances between all points.  Isomap can be performed with the object
 :class:`Isomap`.
 
-.. figure:: ../auto_examples/manifold/images/plot_lle_digits_5.png
+.. figure:: ../auto_examples/manifold/images/plot_lle_digits_005.png
    :target: ../auto_examples/manifold/plot_lle_digits.html
    :align: center
    :scale: 50
@@ -162,7 +162,7 @@ Locally linear embedding can be performed with function
 :func:`locally_linear_embedding` or its object-oriented counterpart
 :class:`LocallyLinearEmbedding`.
 
-.. figure:: ../auto_examples/manifold/images/plot_lle_digits_6.png
+.. figure:: ../auto_examples/manifold/images/plot_lle_digits_006.png
    :target: ../auto_examples/manifold/plot_lle_digits.html
    :align: center
    :scale: 50
@@ -216,7 +216,7 @@ linear embedding* (MLLE).  MLLE can be  performed with function
 :class:`LocallyLinearEmbedding`, with the keyword ``method = 'modified'``.
 It requires ``n_neighbors > n_components``.
 
-.. figure:: ../auto_examples/manifold/images/plot_lle_digits_7.png
+.. figure:: ../auto_examples/manifold/images/plot_lle_digits_007.png
    :target: ../auto_examples/manifold/plot_lle_digits.html
    :align: center
    :scale: 50
@@ -266,7 +266,7 @@ for small output dimension.  HLLE can be  performed with function
 :class:`LocallyLinearEmbedding`, with the keyword ``method = 'hessian'``.
 It requires ``n_neighbors > n_components * (n_components + 3) / 2``.
 
-.. figure:: ../auto_examples/manifold/images/plot_lle_digits_8.png
+.. figure:: ../auto_examples/manifold/images/plot_lle_digits_008.png
    :target: ../auto_examples/manifold/plot_lle_digits.html
    :align: center
    :scale: 50
@@ -358,7 +358,7 @@ tangent spaces to learn the embedding.  LTSA can be performed with function
 :func:`locally_linear_embedding` or its object-oriented counterpart
 :class:`LocallyLinearEmbedding`, with the keyword ``method = 'ltsa'``.
 
-.. figure:: ../auto_examples/manifold/images/plot_lle_digits_9.png
+.. figure:: ../auto_examples/manifold/images/plot_lle_digits_009.png
    :target: ../auto_examples/manifold/plot_lle_digits.html
    :align: center
    :scale: 50
@@ -416,7 +416,7 @@ vision, the algorithms will try to preserve the order of the distances, and
 hence seek for a monotonic relationship between the distances in the embedded
 space and the similarities/dissimilarities.
 
-.. figure:: ../auto_examples/manifold/images/plot_lle_digits_10.png
+.. figure:: ../auto_examples/manifold/images/plot_lle_digits_010.png
    :target: ../auto_examples/manifold/plot_lle_digits.html
    :align: center
    :scale: 50
@@ -451,7 +451,7 @@ A trivial solution to this problem is to set all the points on the origin. In
 order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized.
 
 
-.. figure:: ../auto_examples/manifold/images/plot_mds_1.png
+.. figure:: ../auto_examples/manifold/images/plot_mds_001.png
    :target: ../auto_examples/manifold/plot_mds.html
    :align: center
    :scale: 60
@@ -487,7 +487,7 @@ of the KL divergence. Hence, it is sometimes useful to try different seeds
 and select the embedding with the lowest KL divergence.
 
 
-.. figure:: ../auto_examples/manifold/images/plot_lle_digits_13.png
+.. figure:: ../auto_examples/manifold/images/plot_lle_digits_013.png
    :target: ../auto_examples/manifold/plot_lle_digits.html
    :align: center
    :scale: 50
diff --git a/doc/modules/mixture.rst b/doc/modules/mixture.rst
index bb6de877f164b4cbdba702a6bbc68761f06a2311..14ed5a63a56b38515df4b3f5ef5fda2c5c4922ad 100644
--- a/doc/modules/mixture.rst
+++ b/doc/modules/mixture.rst
@@ -14,7 +14,7 @@ matrices supported), sample them, and estimate them from
 data. Facilities to help determine the appropriate number of
 components are also provided.
 
- .. figure:: ../auto_examples/mixture/images/plot_gmm_pdf_1.png
+ .. figure:: ../auto_examples/mixture/images/plot_gmm_pdf_001.png
    :target: ../auto_examples/mixture/plot_gmm_pdf.html
    :align: center
    :scale: 50%
@@ -55,7 +55,7 @@ The :class:`GMM` comes with different options to constrain the covariance
 of the difference classes estimated: spherical, diagonal, tied or full
 covariance.
 
-.. figure:: ../auto_examples/mixture/images/plot_gmm_classifier_1.png
+.. figure:: ../auto_examples/mixture/images/plot_gmm_classifier_001.png
    :target: ../auto_examples/mixture/plot_gmm_classifier.html
    :align: center
    :scale: 75%
@@ -102,7 +102,7 @@ only in the asymptotic regime (i.e. if much data is available).
 Note that using a :ref:`DPGMM <dpgmm>` avoids the specification of the
 number of components for a Gaussian mixture model.
 
-.. figure:: ../auto_examples/mixture/images/plot_gmm_selection_1.png
+.. figure:: ../auto_examples/mixture/images/plot_gmm_selection_001.png
    :target: ../auto_examples/mixture/plot_gmm_selection.html
    :align: center
    :scale: 50%
@@ -210,11 +210,11 @@ components, and at the expense of extra computational time the user
 only needs to specify a loose upper bound on this number and a
 concentration parameter.
 
-.. |plot_gmm| image:: ../auto_examples/mixture/images/plot_gmm_1.png
+.. |plot_gmm| image:: ../auto_examples/mixture/images/plot_gmm_001.png
    :target: ../auto_examples/mixture/plot_gmm.html
    :scale: 48%
 
-.. |plot_gmm_sin| image:: ../auto_examples/mixture/images/plot_gmm_sin_1.png
+.. |plot_gmm_sin| image:: ../auto_examples/mixture/images/plot_gmm_sin_001.png
    :target: ../auto_examples/mixture/plot_gmm_sin.html
    :scale: 48%
 
diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
index f3fd09b65ae0551c8f2e27602c5d0d252e7bafe6..97d7682b57f277941a2d40c0d7edc29a60dedc06 100644
--- a/doc/modules/model_evaluation.rst
+++ b/doc/modules/model_evaluation.rst
@@ -292,7 +292,7 @@ predicted to be in group :math:`j`. Here an example of such confusion matrix::
 Here a visual representation of such confusion matrix (this figure comes
 from the :ref:`example_plot_confusion_matrix.py` example):
 
-.. image:: ../auto_examples/images/plot_confusion_matrix_1.png
+.. image:: ../auto_examples/images/plot_confusion_matrix_001.png
    :target: ../auto_examples/plot_confusion_matrix.html
    :scale: 75
    :align: center
@@ -794,7 +794,7 @@ Here a small example of how to use the :func:`roc_curve` function::
 
 The following figure shows an example of such ROC curve.
 
-.. image:: ../auto_examples/images/plot_roc_1.png
+.. image:: ../auto_examples/images/plot_roc_001.png
    :target: ../auto_examples/plot_roc.html
    :scale: 75
    :align: center
@@ -835,7 +835,7 @@ F1 score, ROC AUC doesn't require to optimize a threshold for each label. The
 if predicted outputs have been binarized.
 
 
-.. image:: ../auto_examples/images/plot_roc_2.png
+.. image:: ../auto_examples/images/plot_roc_002.png
    :target: ../auto_examples/plot_roc.html
    :scale: 75
    :align: center
diff --git a/doc/modules/multiclass.rst b/doc/modules/multiclass.rst
index a28652879cba9c99403061d77ea091e13b730b2b..2852dd6b763d7a4da79375fb1443bb02c7e0f0e7 100644
--- a/doc/modules/multiclass.rst
+++ b/doc/modules/multiclass.rst
@@ -142,7 +142,7 @@ To use this feature, feed the classifier an indicator matrix, in which cell
 [i, j] indicates the presence of label j in sample i.
 
 
-.. figure:: ../auto_examples/images/plot_multilabel_1.png
+.. figure:: ../auto_examples/images/plot_multilabel_001.png
     :target: ../auto_examples/plot_multilabel.html
     :align: center
     :scale: 75%
diff --git a/doc/modules/neighbors.rst b/doc/modules/neighbors.rst
index f98711f40936c418b1191342dfd22d167bfeb425..5b8d4ba2e2c037e9daf2dec0a2366c102ed7d528 100644
--- a/doc/modules/neighbors.rst
+++ b/doc/modules/neighbors.rst
@@ -184,11 +184,11 @@ distance can be supplied which is used to compute the weights.
 
 
 
-.. |classification_1| image:: ../auto_examples/neighbors/images/plot_classification_1.png
+.. |classification_1| image:: ../auto_examples/neighbors/images/plot_classification_001.png
    :target: ../auto_examples/neighbors/plot_classification.html
    :scale: 50
 
-.. |classification_2| image:: ../auto_examples/neighbors/images/plot_classification_2.png
+.. |classification_2| image:: ../auto_examples/neighbors/images/plot_classification_002.png
    :target: ../auto_examples/neighbors/plot_classification.html
    :scale: 50
 
@@ -227,7 +227,7 @@ weights proportional to the inverse of the distance from the query point.
 Alternatively, a user-defined function of the distance can be supplied,
 which will be used to compute the weights.
 
-.. figure:: ../auto_examples/neighbors/images/plot_regression_1.png
+.. figure:: ../auto_examples/neighbors/images/plot_regression_001.png
    :target: ../auto_examples/neighbors/plot_regression.html
    :align: center
    :scale: 75
@@ -237,7 +237,7 @@ The use of multi-output nearest neighbors for regression is demonstrated in
 X are the pixels of the upper half of faces and the outputs Y are the pixels of
 the lower half of those faces.
 
-.. figure:: ../auto_examples/images/plot_multioutput_face_completion_1.png
+.. figure:: ../auto_examples/images/plot_multioutput_face_completion_001.png
    :target: ../auto_examples/plot_multioutput_face_completion.html
    :scale: 75
    :align: center
@@ -496,11 +496,11 @@ This is useful, for example, for removing noisy features.
 In the example below, using a small shrink threshold increases the accuracy of
 the model from 0.81 to 0.82.
 
-.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_1.png
+.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_001.png
    :target: ../auto_examples/neighbors/plot_classification.html
    :scale: 50
 
-.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_2.png
+.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_002.png
    :target: ../auto_examples/neighbors/plot_classification.html
    :scale: 50
 
diff --git a/doc/modules/neural_networks.rst b/doc/modules/neural_networks.rst
index 7c1cc36bd29be689c20f3c3a0c6c1121415a97b5..7519ba01a15dd9458770756cde382995e6906387 100644
--- a/doc/modules/neural_networks.rst
+++ b/doc/modules/neural_networks.rst
@@ -32,7 +32,7 @@ density estimation.
 The method gained popularity for initializing deep neural networks with the
 weights of independent RBMs. This method is known as unsupervised pre-training.
 
-.. figure:: ../auto_examples/images/plot_rbm_logistic_classification_1.png
+.. figure:: ../auto_examples/images/plot_rbm_logistic_classification_001.png
    :target: ../auto_examples/plot_rbm_logistic_classification.html
    :align: center
    :scale: 100%
diff --git a/doc/modules/outlier_detection.rst b/doc/modules/outlier_detection.rst
index 08b1616f819db2d98eafc86c76ce1b935ca9653f..ee7c483c73a7edc28797f9d9ffc6cacc6d73f907 100644
--- a/doc/modules/outlier_detection.rst
+++ b/doc/modules/outlier_detection.rst
@@ -69,7 +69,7 @@ but regular, observation outside the frontier.
      frontier learned around some data by a
      :class:`svm.OneClassSVM` object.
 
-.. figure:: ../auto_examples/svm/images/plot_oneclass_1.png
+.. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
    :target: ../auto_examples/svm/plot_oneclasse.html
    :align: center
    :scale: 75%
@@ -105,7 +105,7 @@ whithout being influenced by outliers). The Mahalanobis distances
 obtained from this estimate is used to derive a measure of outlyingness.
 This strategy is illustrated below.
 
-.. figure:: ../auto_examples/covariance/images/plot_mahalanobis_distances_1.png
+.. figure:: ../auto_examples/covariance/images/plot_mahalanobis_distances_001.png
    :target: ../auto_examples/covariance/plot_mahalanobis_distances.html
    :align: center
    :scale: 75%
@@ -138,15 +138,15 @@ The examples below illustrate how the performance of the
 less unimodal.  :class:`svm.OneClassSVM` works better on data with
 multiple modes.
 
-.. |outlier1| image:: ../auto_examples/covariance/images/plot_outlier_detection_1.png
+.. |outlier1| image:: ../auto_examples/covariance/images/plot_outlier_detection_001.png
    :target: ../auto_examples/covariance/plot_outlier_detection.html
    :scale: 50%
 
-.. |outlier2| image:: ../auto_examples/covariance/images/plot_outlier_detection_2.png
+.. |outlier2| image:: ../auto_examples/covariance/images/plot_outlier_detection_002.png
    :target: ../auto_examples/covariance/plot_outlier_detection.html
    :scale: 50%
 
-.. |outlier3| image:: ../auto_examples/covariance/images/plot_outlier_detection_3.png
+.. |outlier3| image:: ../auto_examples/covariance/images/plot_outlier_detection_003.png
    :target: ../auto_examples/covariance/plot_outlier_detection.html
    :scale: 50%
 
diff --git a/doc/modules/random_projection.rst b/doc/modules/random_projection.rst
index 51d874650ff2fa17d4a23968eb6210cfd2ad4fac..e6ef3cb63e02a035886f048fe8bb7537bf3d633f 100644
--- a/doc/modules/random_projection.rst
+++ b/doc/modules/random_projection.rst
@@ -64,12 +64,12 @@ bounded distortion introduced by the random projection::
   >>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1)
   array([ 7894,  9868, 11841])
 
-.. figure:: ../auto_examples/images/plot_johnson_lindenstrauss_bound_1.png
+.. figure:: ../auto_examples/images/plot_johnson_lindenstrauss_bound_001.png
    :target: ../auto_examples/plot_johnson_lindenstrauss_bound.html
    :scale: 75
    :align: center
 
-.. figure:: ../auto_examples/images/plot_johnson_lindenstrauss_bound_2.png
+.. figure:: ../auto_examples/images/plot_johnson_lindenstrauss_bound_002.png
    :target: ../auto_examples/plot_johnson_lindenstrauss_bound.html
    :scale: 75
    :align: center
diff --git a/doc/modules/scaling_strategies.rst b/doc/modules/scaling_strategies.rst
index 0131650f47e3037632041e5a370db93cc13f1314..e4c5e9953ca756cb14d49fd53da5965594aecb98 100644
--- a/doc/modules/scaling_strategies.rst
+++ b/doc/modules/scaling_strategies.rst
@@ -95,7 +95,7 @@ systems and demonstrates most of the notions discussed above.
 Furthermore, it also shows the evolution of the performance of different
 algorithms with the number of processed examples.
 
-.. |accuracy_over_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_1.png
+.. |accuracy_over_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_001.png
     :target: ../auto_examples/applications/plot_out_of_core_classification.html
     :scale: 80
 
@@ -107,7 +107,7 @@ algorithms, ``MultinomialNB`` is the most expensive, but its overhead can be
 mitigated by increasing the size of the mini-batches (exercise: change 
 ``minibatch_size`` to 100 and 10000 in the program and compare).
 
-.. |computation_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_3.png
+.. |computation_time| image::  ../auto_examples/applications/images/plot_out_of_core_classification_003.png
     :target: ../auto_examples/applications/plot_out_of_core_classification.html
     :scale: 80
 
diff --git a/doc/modules/sgd.rst b/doc/modules/sgd.rst
index 4a09081399ac85859fe24a76b779dcf13e56500a..7f8fb758cc7e01fccc23ffe68b50d80f3da5a2dc 100644
--- a/doc/modules/sgd.rst
+++ b/doc/modules/sgd.rst
@@ -46,7 +46,7 @@ The class :class:`SGDClassifier` implements a plain stochastic gradient
 descent learning routine which supports different loss functions and
 penalties for classification.
 
-.. figure:: ../auto_examples/linear_model/images/plot_sgd_separating_hyperplane_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_sgd_separating_hyperplane_001.png
    :target: ../auto_examples/linear_model/plot_sgd_separating_hyperplane.html
    :align: center
    :scale: 75
@@ -136,7 +136,7 @@ below illustrates the OVA approach on the iris dataset.  The dashed
 lines represent the three OVA classifiers; the background colors show
 the decision surface induced by the three classifiers.
 
-.. figure:: ../auto_examples/linear_model/images/plot_sgd_iris_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_sgd_iris_001.png
    :target: ../auto_examples/linear_model/plot_sgd_iris.html
    :align: center
    :scale: 75
@@ -283,7 +283,7 @@ Different choices for :math:`L` entail different classifiers such as
 All of the above loss functions can be regarded as an upper bound on the
 misclassification error (Zero-one loss) as shown in the Figure below.
 
-.. figure:: ../auto_examples/linear_model/images/plot_sgd_loss_functions_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_sgd_loss_functions_001.png
    :align: center
    :scale: 75
 
@@ -297,7 +297,7 @@ Popular choices for the regularization term :math:`R` include:
 The Figure below shows the contours of the different regularization terms
 in the parameter space when :math:`R(w) = 1`.
 
-.. figure:: ../auto_examples/linear_model/images/plot_sgd_penalties_1.png
+.. figure:: ../auto_examples/linear_model/images/plot_sgd_penalties_001.png
    :align: center
    :scale: 75
 
diff --git a/doc/modules/svm.rst b/doc/modules/svm.rst
index afb3b8e41ee51bade315c3ff868d6bb2d09e8b0c..dc8d3bbaf757a84e5c9033b7bd475c447f99b0fd 100644
--- a/doc/modules/svm.rst
+++ b/doc/modules/svm.rst
@@ -51,7 +51,7 @@ Classification
 capable of performing multi-class classification on a dataset.
 
 
-.. figure:: ../auto_examples/svm/images/plot_iris_1.png
+.. figure:: ../auto_examples/svm/images/plot_iris_001.png
    :target: ../auto_examples/svm/plot_iris.html
    :align: center
 
@@ -243,7 +243,7 @@ classes or certain individual samples keywords ``class_weight`` and
 ``{class_label : value}``, where value is a floating point number > 0
 that sets the parameter ``C`` of class ``class_label`` to ``C * value``.
 
-.. figure:: ../auto_examples/svm/images/plot_separating_hyperplane_unbalanced_1.png
+.. figure:: ../auto_examples/svm/images/plot_separating_hyperplane_unbalanced_001.png
    :target: ../auto_examples/svm/plot_separating_hyperplane_unbalanced.html
    :align: center
    :scale: 75
@@ -255,7 +255,7 @@ that sets the parameter ``C`` of class ``class_label`` to ``C * value``.
 set the parameter ``C`` for the i-th example to ``C * sample_weight[i]``.
 
 
-.. figure:: ../auto_examples/svm/images/plot_weighted_samples_1.png
+.. figure:: ../auto_examples/svm/images/plot_weighted_samples_001.png
    :target: ../auto_examples/svm/plot_weighted_samples.html
    :align: center
    :scale: 75
@@ -325,7 +325,7 @@ will only take as input an array X, as there are no class labels.
 
 See, section :ref:`outlier_detection` for more details on this usage.
 
-.. figure:: ../auto_examples/svm/images/plot_oneclass_1.png
+.. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
    :target: ../auto_examples/svm/plot_oneclass.html
    :align: center
    :scale: 75
@@ -537,7 +537,7 @@ margin), since in general the larger the margin the lower the
 generalization error of the classifier.
 
 
-.. figure:: ../auto_examples/svm/images/plot_separating_hyperplane_1.png
+.. figure:: ../auto_examples/svm/images/plot_separating_hyperplane_001.png
    :align: center
    :scale: 75
 
diff --git a/doc/modules/tree.rst b/doc/modules/tree.rst
index 3e4703351fea8a22a638e48688ac6b5007fb31be..e4c55bc88129a57f72d56ebcb922a8be649f6cf2 100644
--- a/doc/modules/tree.rst
+++ b/doc/modules/tree.rst
@@ -16,7 +16,7 @@ For instance, in the example below, decision trees learn from data to
 approximate a sine curve with a set of if-then-else decision rules. The deeper
 the tree, the more complex the decision rules and the fitter the model.
 
-.. figure:: ../auto_examples/tree/images/plot_tree_regression_1.png
+.. figure:: ../auto_examples/tree/images/plot_tree_regression_001.png
    :target: ../auto_examples/tree/plot_tree_regression.html
    :scale: 75
    :align: center
@@ -160,7 +160,7 @@ After being fitted, the model can then be used to predict new values::
     >>> clf.predict(iris.data[0, :])
     array([0])
 
-.. figure:: ../auto_examples/tree/images/plot_iris_1.png
+.. figure:: ../auto_examples/tree/images/plot_iris_001.png
    :target: ../auto_examples/tree/plot_iris.html
    :align: center
    :scale: 75
@@ -175,7 +175,7 @@ After being fitted, the model can then be used to predict new values::
 Regression
 ==========
 
-.. figure:: ../auto_examples/tree/images/plot_tree_regression_1.png
+.. figure:: ../auto_examples/tree/images/plot_tree_regression_001.png
    :target: ../auto_examples/tree/plot_tree_regression.html
    :scale: 75
    :align: center
@@ -240,7 +240,7 @@ The use of multi-output trees for regression is demonstrated in
 :ref:`example_tree_plot_tree_regression_multioutput.py`. In this example, the input
 X is a single real value and the outputs Y are the sine and cosine of X.
 
-.. figure:: ../auto_examples/tree/images/plot_tree_regression_multioutput_1.png
+.. figure:: ../auto_examples/tree/images/plot_tree_regression_multioutput_001.png
    :target: ../auto_examples/tree/plot_tree_regression_multioutput.html
    :scale: 75
    :align: center
@@ -250,7 +250,7 @@ The use of multi-output trees for classification is demonstrated in
 X are the pixels of the upper half of faces and the outputs Y are the pixels of
 the lower half of those faces.
 
-.. figure:: ../auto_examples/images/plot_multioutput_face_completion_1.png
+.. figure:: ../auto_examples/images/plot_multioutput_face_completion_001.png
    :target: ../auto_examples/plot_multioutput_face_completion.html
    :scale: 75
    :align: center
diff --git a/doc/sphinxext/gen_rst.py b/doc/sphinxext/gen_rst.py
index dd1f766acb97aee83291ddd0215c0f40b490f7e4..a7ac4dd2861ca53a80ecae660a5ac0f95995e900 100644
--- a/doc/sphinxext/gen_rst.py
+++ b/doc/sphinxext/gen_rst.py
@@ -415,11 +415,11 @@ SINGLE_IMAGE = """
 # thumbnails for the front page of the scikit-learn home page.
 # key: first image in set
 # values: (number of plot in set, height of thumbnail)
-carousel_thumbs = {'plot_classifier_comparison_1.png': (1, 600),
-                   'plot_outlier_detection_1.png': (3, 372),
-                   'plot_gp_regression_1.png': (2, 250),
-                   'plot_adaboost_twoclass_1.png': (1, 372),
-                   'plot_compare_methods_1.png': (1, 349)}
+carousel_thumbs = {'plot_classifier_comparison_001.png': (1, 600),
+                   'plot_outlier_detection_001.png': (3, 372),
+                   'plot_gp_regression_001.png': (2, 250),
+                   'plot_adaboost_twoclass_001.png': (1, 372),
+                   'plot_compare_methods_001.png': (1, 349)}
 
 
 def extract_docstring(filename, ignore_heading=False):
@@ -883,7 +883,7 @@ def generate_file_rst(fname, target_dir, src_dir, root_dir, plot_gallery):
     """ Generate the rst file for a given example.
     """
     base_image_name = os.path.splitext(fname)[0]
-    image_fname = '%s_%%s.png' % base_image_name
+    image_fname = '%s_%%03d.png' % base_image_name
 
     this_template = rst_template
     last_dir = os.path.split(src_dir)[-1]
@@ -988,12 +988,8 @@ def generate_file_rst(fname, target_dir, src_dir, root_dir, plot_gallery):
             print(" - time elapsed : %.2g sec" % time_elapsed)
         else:
             figure_list = [f[len(image_dir):]
-                            for f in glob.glob(image_path % '[1-9]')]
-                            #for f in glob.glob(image_path % '*')]
-            # Catter for the fact that there can be more than 10 images
-            if len(figure_list) >= 9:
-                figure_list.extend([f[len(image_dir):]
-                            for f in glob.glob(image_path % '1[0-9]')])
+                           for f in glob.glob(image_path.replace("%03d", '*'))]
+        figure_list.sort()
 
         # generate thumb file
         this_template = plot_rst_template
diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst
index 5451fc7916325a18114bd3ba5c78f0626666d17b..bb6c1fd943f5595e4d7ebbf95dd29a167101e607 100644
--- a/doc/tutorial/basic/tutorial.rst
+++ b/doc/tutorial/basic/tutorial.rst
@@ -189,7 +189,7 @@ which we have not used to train the classifier::
 
 The corresponding image is the following:
 
-.. image:: ../../auto_examples/datasets/images/plot_digits_last_image_1.png
+.. image:: ../../auto_examples/datasets/images/plot_digits_last_image_001.png
     :target: ../../auto_examples/datasets/plot_digits_last_image.html
     :align: center
     :scale: 50
diff --git a/doc/tutorial/statistical_inference/model_selection.rst b/doc/tutorial/statistical_inference/model_selection.rst
index 828143225012a930139ed7ffcacddd779ac06619..76ca0bbfebc4d008e47fa96abcf2af9d48c18898 100644
--- a/doc/tutorial/statistical_inference/model_selection.rst
+++ b/doc/tutorial/statistical_inference/model_selection.rst
@@ -110,7 +110,7 @@ of the computer.
 .. topic:: **Exercise**
    :class: green
 
-   .. image:: ../../auto_examples/exercises/images/plot_cv_digits_1.png
+   .. image:: ../../auto_examples/exercises/images/plot_cv_digits_001.png
         :target: ../../auto_examples/exercises/plot_cv_digits.html
         :align: right
         :scale: 90
diff --git a/doc/tutorial/statistical_inference/putting_together.rst b/doc/tutorial/statistical_inference/putting_together.rst
index 4a1260b1cdabe1ee5eb4efb68bcc8755f97c13f0..eec42d0e7f3c6167fba5bb9488b98b492c9dea28 100644
--- a/doc/tutorial/statistical_inference/putting_together.rst
+++ b/doc/tutorial/statistical_inference/putting_together.rst
@@ -11,7 +11,7 @@ Pipelining
 We have seen that some estimators can transform data and that some estimators
 can predict variables. We can also create combined estimators:
 
-.. image:: ../../auto_examples/images/plot_digits_pipe_1.png
+.. image:: ../../auto_examples/images/plot_digits_pipe_001.png
    :target: ../../auto_examples/plot_digits_pipe.html
    :scale: 65
    :align: right
diff --git a/doc/tutorial/statistical_inference/settings.rst b/doc/tutorial/statistical_inference/settings.rst
index 8865212f6149b7f4625f559c567db7ff1a205993..fead00cf952fb1abb409a2700567c191523f994e 100644
--- a/doc/tutorial/statistical_inference/settings.rst
+++ b/doc/tutorial/statistical_inference/settings.rst
@@ -31,7 +31,7 @@ needs to be preprocessed in order to be used by scikit-learn.
 
 .. topic:: An example of reshaping data would be the digits dataset 
 
-    .. image:: ../../auto_examples/datasets/images/plot_digits_last_image_1.png
+    .. image:: ../../auto_examples/datasets/images/plot_digits_last_image_001.png
         :target: ../../auto_examples/datasets/plot_digits_last_image.html
         :align: right
         :scale: 60
diff --git a/doc/tutorial/statistical_inference/supervised_learning.rst b/doc/tutorial/statistical_inference/supervised_learning.rst
index a4b382d1868764ba99b905952e9e3ba1165b7c5b..7f54a1e92e905fe0f6f270f4318177e05824422f 100644
--- a/doc/tutorial/statistical_inference/supervised_learning.rst
+++ b/doc/tutorial/statistical_inference/supervised_learning.rst
@@ -38,7 +38,7 @@ Nearest neighbor and the curse of dimensionality
 
 .. topic:: Classifying irises:
 
-    .. image:: ../../auto_examples/datasets/images/plot_iris_dataset_1.png
+    .. image:: ../../auto_examples/datasets/images/plot_iris_dataset_001.png
         :target: ../../auto_examples/datasets/plot_iris_dataset.html
         :align: right
 	:scale: 65
@@ -75,7 +75,7 @@ Scikit-learn documentation for more information about this type of classifier.)
 
 **KNN (k nearest neighbors) classification example**:
 
-.. image:: ../../auto_examples/neighbors/images/plot_classification_1.png
+.. image:: ../../auto_examples/neighbors/images/plot_classification_001.png
    :target: ../../auto_examples/neighbors/plot_classification.html
    :align: center
    :scale: 70
@@ -158,7 +158,7 @@ in it's simplest form, fits a linear model to the data set by adjusting
 a set of parameters in order to make the sum of the squared residuals
 of the model as small as possible.
 
-.. image:: ../../auto_examples/linear_model/images/plot_ols_1.png
+.. image:: ../../auto_examples/linear_model/images/plot_ols_001.png
    :target: ../../auto_examples/linear_model/plot_ols.html
    :scale: 40
    :align: right
@@ -199,7 +199,7 @@ Shrinkage
 If there are few data points per dimension, noise in the observations
 induces high variance:
 
-.. image:: ../../auto_examples/linear_model/images/plot_ols_ridge_variance_1.png
+.. image:: ../../auto_examples/linear_model/images/plot_ols_ridge_variance_001.png
    :target: ../../auto_examples/linear_model/plot_ols_ridge_variance.html
    :scale: 70
    :align: right
@@ -228,7 +228,7 @@ regression coefficients to zero: any two randomly chosen set of
 observations are likely to be uncorrelated. This is called :class:`Ridge`
 regression:
 
-.. image:: ../../auto_examples/linear_model/images/plot_ols_ridge_variance_2.png
+.. image:: ../../auto_examples/linear_model/images/plot_ols_ridge_variance_002.png
    :target: ../../auto_examples/linear_model/plot_ols_ridge_variance.html
    :scale: 70
    :align: right
@@ -274,15 +274,15 @@ Sparsity
 ----------
 
 
-.. |diabetes_ols_1| image:: ../../auto_examples/linear_model/images/plot_ols_3d_1.png
+.. |diabetes_ols_1| image:: ../../auto_examples/linear_model/images/plot_ols_3d_001.png
    :target: ../../auto_examples/linear_model/plot_ols_3d.html
    :scale: 65
 
-.. |diabetes_ols_3| image:: ../../auto_examples/linear_model/images/plot_ols_3d_3.png
+.. |diabetes_ols_3| image:: ../../auto_examples/linear_model/images/plot_ols_3d_003.png
    :target: ../../auto_examples/linear_model/plot_ols_3d.html
    :scale: 65
 
-.. |diabetes_ols_2| image:: ../../auto_examples/linear_model/images/plot_ols_3d_2.png
+.. |diabetes_ols_2| image:: ../../auto_examples/linear_model/images/plot_ols_3d_002.png
    :target: ../../auto_examples/linear_model/plot_ols_3d.html
    :scale: 65
 
@@ -349,7 +349,7 @@ application of Occam's razor: *prefer simpler models*.
 Classification
 ---------------
 
-.. image:: ../../auto_examples/linear_model/images/plot_logistic_1.png
+.. image:: ../../auto_examples/linear_model/images/plot_logistic_001.png
    :target: ../../auto_examples/linear_model/plot_logistic.html
    :scale: 65
    :align: right
@@ -375,7 +375,7 @@ function or **logistic** function:
 
 This is known as :class:`LogisticRegression`.
 
-.. image:: ../../auto_examples/linear_model/images/plot_iris_logistic_1.png
+.. image:: ../../auto_examples/linear_model/images/plot_iris_logistic_001.png
    :target: ../../auto_examples/linear_model/plot_iris_logistic.html
    :scale: 83
 
@@ -423,11 +423,11 @@ the separating line (less regularization).
 
 .. currentmodule :: sklearn.svm
 
-.. |svm_margin_unreg| image:: ../../auto_examples/svm/images/plot_svm_margin_1.png
+.. |svm_margin_unreg| image:: ../../auto_examples/svm/images/plot_svm_margin_001.png
    :target: ../../auto_examples/svm/plot_svm_margin.html
    :scale: 70
 
-.. |svm_margin_reg| image:: ../../auto_examples/svm/images/plot_svm_margin_2.png
+.. |svm_margin_reg| image:: ../../auto_examples/svm/images/plot_svm_margin_002.png
    :target: ../../auto_examples/svm/plot_svm_margin.html
    :scale: 70
 
@@ -473,11 +473,11 @@ build a decision function that is not linear but may be polynomial instead.
 This is done using the *kernel trick* that can be seen as
 creating a decision energy by positioning *kernels* on observations:
 
-.. |svm_kernel_linear| image:: ../../auto_examples/svm/images/plot_svm_kernels_1.png
+.. |svm_kernel_linear| image:: ../../auto_examples/svm/images/plot_svm_kernels_001.png
    :target: ../../auto_examples/svm/plot_svm_kernels.html
    :scale: 65
 
-.. |svm_kernel_poly| image:: ../../auto_examples/svm/images/plot_svm_kernels_2.png
+.. |svm_kernel_poly| image:: ../../auto_examples/svm/images/plot_svm_kernels_002.png
    :target: ../../auto_examples/svm/plot_svm_kernels.html
    :scale: 65
 
@@ -515,7 +515,7 @@ creating a decision energy by positioning *kernels* on observations:
 
 
 
-.. |svm_kernel_rbf| image:: ../../auto_examples/svm/images/plot_svm_kernels_3.png
+.. |svm_kernel_rbf| image:: ../../auto_examples/svm/images/plot_svm_kernels_003.png
    :target: ../../auto_examples/svm/plot_svm_kernels.html
    :scale: 65
 
@@ -548,7 +548,7 @@ creating a decision energy by positioning *kernels* on observations:
    ``svm_gui.py``; add data points of both classes with right and left button,
    fit the model and change parameters and data.
 
-.. image:: ../../auto_examples/datasets/images/plot_iris_dataset_1.png
+.. image:: ../../auto_examples/datasets/images/plot_iris_dataset_001.png
     :target: ../../auto_examples/datasets/plot_iris_dataset.html
     :align: right
     :scale: 70
diff --git a/doc/tutorial/statistical_inference/unsupervised_learning.rst b/doc/tutorial/statistical_inference/unsupervised_learning.rst
index 1d281e9a7d26b72a72d45f96ace854febb44eda4..d62c7e50d61bb2c4382c6a3400530b3326285bf4 100644
--- a/doc/tutorial/statistical_inference/unsupervised_learning.rst
+++ b/doc/tutorial/statistical_inference/unsupervised_learning.rst
@@ -24,7 +24,7 @@ Note that there exist a lot of different clustering criteria and associated
 algorithms. The simplest clustering algorithm is 
 :ref:`k_means`.
 
-.. image:: ../../auto_examples/cluster/images/plot_cluster_iris_2.png
+.. image:: ../../auto_examples/cluster/images/plot_cluster_iris_002.png
     :target: ../../auto_examples/cluster/plot_cluster_iris.html
     :scale: 70
     :align: right
@@ -45,15 +45,15 @@ algorithms. The simplest clustering algorithm is
     >>> print(y_iris[::10])
     [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
 
-.. |k_means_iris_bad_init| image:: ../../auto_examples/cluster/images/plot_cluster_iris_3.png
+.. |k_means_iris_bad_init| image:: ../../auto_examples/cluster/images/plot_cluster_iris_003.png
    :target: ../../auto_examples/cluster/plot_cluster_iris.html
    :scale: 63
 
-.. |k_means_iris_8| image:: ../../auto_examples/cluster/images/plot_cluster_iris_1.png
+.. |k_means_iris_8| image:: ../../auto_examples/cluster/images/plot_cluster_iris_001.png
    :target: ../../auto_examples/cluster/plot_cluster_iris.html
    :scale: 63
 
-.. |cluster_iris_truth| image:: ../../auto_examples/cluster/images/plot_cluster_iris_4.png
+.. |cluster_iris_truth| image:: ../../auto_examples/cluster/images/plot_cluster_iris_004.png
    :target: ../../auto_examples/cluster/plot_cluster_iris.html
    :scale: 63
 
@@ -85,19 +85,19 @@ algorithms. The simplest clustering algorithm is
 
     **Don't over-interpret clustering results**
 
-.. |lena| image:: ../../auto_examples/cluster/images/plot_lena_compress_1.png
+.. |lena| image:: ../../auto_examples/cluster/images/plot_lena_compress_001.png
    :target: ../../auto_examples/cluster/plot_lena_compress.html
    :scale: 60
 
-.. |lena_regular| image:: ../../auto_examples/cluster/images/plot_lena_compress_2.png
+.. |lena_regular| image:: ../../auto_examples/cluster/images/plot_lena_compress_002.png
    :target: ../../auto_examples/cluster/plot_lena_compress.html
    :scale: 60
 
-.. |lena_compressed| image:: ../../auto_examples/cluster/images/plot_lena_compress_3.png
+.. |lena_compressed| image:: ../../auto_examples/cluster/images/plot_lena_compress_003.png
    :target: ../../auto_examples/cluster/plot_lena_compress.html
    :scale: 60
 
-.. |lena_histogram| image:: ../../auto_examples/cluster/images/plot_lena_compress_4.png
+.. |lena_histogram| image:: ../../auto_examples/cluster/images/plot_lena_compress_004.png
    :target: ../../auto_examples/cluster/plot_lena_compress.html
    :scale: 60
 
@@ -177,7 +177,7 @@ This can be useful, for instance, to retrieve connected regions (sometimes
 also referred to as connected components) when
 clustering an image:
 
-.. image:: ../../auto_examples/cluster/images/plot_lena_ward_segmentation_1.png
+.. image:: ../../auto_examples/cluster/images/plot_lena_ward_segmentation_001.png
     :target: ../../auto_examples/cluster/plot_lena_ward_segmentation.html
     :scale: 40
     :align: right
@@ -200,7 +200,7 @@ features: **feature agglomeration**. This approach can be implemented by
 clustering in the feature direction, in other words clustering the
 transposed data.
 
-.. image:: ../../auto_examples/cluster/images/plot_digits_agglomeration_1.png
+.. image:: ../../auto_examples/cluster/images/plot_digits_agglomeration_001.png
     :target: ../../auto_examples/cluster/plot_digits_agglomeration.html
     :align: right
     :scale: 57
@@ -242,11 +242,11 @@ Principal component analysis: PCA
 :ref:`PCA` selects the successive components that
 explain the maximum variance in the signal.
 
-.. |pca_3d_axis| image:: ../../auto_examples/decomposition/images/plot_pca_3d_1.png
+.. |pca_3d_axis| image:: ../../auto_examples/decomposition/images/plot_pca_3d_001.png
    :target: ../../auto_examples/decomposition/plot_pca_3d.html
    :scale: 70
 
-.. |pca_3d_aligned| image:: ../../auto_examples/decomposition/images/plot_pca_3d_2.png
+.. |pca_3d_aligned| image:: ../../auto_examples/decomposition/images/plot_pca_3d_002.png
    :target: ../../auto_examples/decomposition/plot_pca_3d.html
    :scale: 70
 
@@ -294,7 +294,7 @@ Independent Component Analysis: ICA
 a maximum amount of independent information. It is able to recover
 **non-Gaussian** independent signals:
 
-.. image:: ../../auto_examples/decomposition/images/plot_ica_blind_source_separation_1.png
+.. image:: ../../auto_examples/decomposition/images/plot_ica_blind_source_separation_001.png
    :target: ../../auto_examples/decomposition/plot_ica_blind_source_separation.html
    :scale: 70
    :align: center