diff --git a/doc/datasets/twenty_newsgroups.inc b/doc/datasets/twenty_newsgroups.inc
index 99d1f613d5b4d86f5adb784a3ef234e7953ccd71..8d41f4a0aac8ab88755b35506a98c83e7cf24133 100644
--- a/doc/datasets/twenty_newsgroups.inc
+++ b/doc/datasets/twenty_newsgroups.inc
@@ -1,3 +1,5 @@
+.. _20newsgroups:
+
 The 20 newsgroups text dataset
 ==============================
 
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
index e446eed81d1ff9e2024231f95d0bd6c048da3883..efe86133688a1874d1296684d81a7bfed3352f4b 100644
--- a/doc/modules/classes.rst
+++ b/doc/modules/classes.rst
@@ -273,6 +273,8 @@ From images
 
    feature_extraction.image.PatchExtractor
 
+.. _text_feature_extraction_ref:
+
 From text
 ---------
 
@@ -286,9 +288,6 @@ From text
    :toctree: generated/
    :template: class.rst
 
-   feature_extraction.text.RomanPreprocessor
-   feature_extraction.text.WordNGramAnalyzer
-   feature_extraction.text.CharNGramAnalyzer
    feature_extraction.text.CountVectorizer
    feature_extraction.text.TfidfTransformer
    feature_extraction.text.Vectorizer
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
index afa09358b4858c7da4ab1b52e88c20aec9283759..c443069afb4dc44a15cc0a06392f4396794535c0 100644
--- a/doc/modules/feature_extraction.rst
+++ b/doc/modules/feature_extraction.rst
@@ -17,7 +17,206 @@ Text feature extraction
 
 .. currentmodule:: sklearn.feature_extraction.text
 
-XXX: a lot to do here
+
+The Bags of Words representation
+--------------------------------
+
+Natural Language Processing is a major application field for machine
+learning algorithms. However the raw data, a sequence of symbols cannot
+be fed directly to the algorithms themselves as most of them expect
+numerical feature vectors with a fixed size rather than the raw text
+documents with variable length.
+
+In order to address this, scikit-learn provides utilities for the most
+common ways to extract numerical features from text content, namely:
+
+- **tokenizing** strings and giving an integer id for each possible token,
+  for instance by using whitespaces and punctuation as token separators.
+
+- **counting** the occurrences of tokens in each document.
+
+- **normalizing** and weighting with diminishing importance tokens that
+  occur in the majority of samples / documents.
+
+In this scheme, features and samples are defined as follows:
+
+- each **individual token occurrence frequency** (normalized or not)
+  is treated as a **feature**.
+
+- the vector of all the token frequencies for a given **document** is
+  considered a multivariate **sample**.
+
+A corpus of documents can thus be represented by a matrix with one row
+per document and one column per token (e.g. word) occurring in the corpus.
+
+We call **vectorization** the general process of turning a collection of
+text documents into numerical feature vectors. This specific stragegy
+(tokenization, counting and normalization) is called the **Bags of
+Words** representation as documents are descriped by word occurrences
+while completely ignoring the relative position information of the words
+in the document.
+
+
+Sparsity
+--------
+
+As most documents will typically use a very subset of a the words used in
+the corpus, the resulting matrix will have many feature values that are
+zeros (typically more than 99% of them).
+
+For instance a collection of 10,000 short text documents (such as emails)
+will use a vocabulary with a size in the order of 100,000 unique words in
+total while each document will use 100 to 1000 unique words individually.
+
+In order to be able to store this such a matrix in memory but also to
+speed up algebraic operations matrix / vector, implementations will
+typically use a sparse representation such as the implementations
+available in the ``scipy.sparse`` package.
+
+
+Common Vectorizer usage
+-----------------------
+
+:class:`CountVectorizer` implements both tokenization and occurrence
+counting in a single class::
+
+  >>> from sklearn.feature_extraction.text import CountVectorizer
+
+This model has many parameters, however the default values are quite
+reasonable (please refer to the :ref:`reference documentation
+<text_feature_extraction_ref>` for the details)::
+
+  >>> vectorizer = CountVectorizer()
+  >>> vectorizer
+  CountVectorizer(binary=False, charset='utf-8', charset_error='strict',
+          dtype=<type 'long'>, fixed_vocabulary=None, input='content',
+          lowercase=True, max_df=1.0, max_features=None, max_n=1, min_n=1,
+          stop_words=None, strip_accents='ascii', strip_tags=False,
+          token_pattern=u'\\b\\w\\w+\\b', tokenize='word')
+
+Let's use it to tokenize and count the word occurrences of a minimalistic
+corpus of text documents::
+
+  >>> corpus = [
+  ...     'This is the first document.',
+  ...     'This is the second second document.',
+  ...     'And the third one.',
+  ...     'Is this the first document?',
+  ... ]
+  >>> X = vectorizer.fit_transform(corpus)
+  >>> X                                       # doctest: +NORMALIZE_WHITESPACE
+  <4x9 sparse matrix of type '<type 'numpy.int64'>'
+      with 19 stored elements in COOrdinate format>
+
+The default configuration tokenize the string by extracting words of
+at least 2 letters. The specific function that does this step can be
+requested explicitly::
+
+  >>> analyze = vectorizer.build_analyzer()
+  >>> analyze("This is a text document to analyze.")
+  ['this', 'is', 'text', 'document', 'to', 'analyze']
+
+Each term found by the analyzer during the fit is assigned a unique
+integer index to assign it a column in the resulting matrix.  This
+interpretation of the columns can be retrieved as follows::
+
+  >>> list(vectorizer.get_feature_names())
+  ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
+
+  >>> X.toarray()
+  array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
+         [0, 1, 0, 1, 0, 2, 1, 0, 1],
+         [1, 0, 0, 0, 1, 0, 1, 1, 0],
+         [0, 1, 1, 1, 0, 0, 1, 0, 1]])
+
+The converse mapping from feature name to column index is stored in the
+``vocabulary_`` attribute of the vectorizer::
+
+  >>> vectorizer.vocabulary_.get('document')
+  1
+
+Hence words that were not seen in the training corpus will be completely
+ignored in future calls to the transform method::
+
+  >>> vectorizer.transform(['Something completely new.']).toarray()
+  array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])
+
+Note that in the previous corpus, the first and the last documents have
+exaclty the same words hence are encoded in equal vectors. In particular
+we lose the info that the last document is an interogative form. To
+preserve some of the local ordering information we can extract 2-grams
+of words in addition to the 1-grams (the word themselvs)::
+
+  >>> bigram_vectorizer = CountVectorizer(min_n=1, max_n=2,
+  ...                                     token_pattern=ur'\b\w+\b')
+  >>> analyze = bigram_vectorizer.build_analyzer()
+  >>> analyze('Bi-grams are cool!')
+  [u'bi', u'grams', u'are', u'cool', u'bi grams', u'grams are', u'are cool']
+
+The vocabulary extracted by this vectorizer is hence much bigger and
+can now resolve ambiguities encoded in local positioning patterns::
+
+  >>> X_2 = bigram_vectorizer.fit_transform(corpus)
+  >>> X_2.toarray()
+  array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
+         [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
+         [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
+         [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]])
+
+
+In particular the interogative form "Is this" is only present in the
+last document::
+
+  >>> bigram_vectorizer.vocabulary_.get(u'is this')
+  7
+
+
+TF-IDF normalization
+--------------------
+
+TODO: Explain
+
+TODO: Limitations of TF-IDF: short text, binary occurrences better
+
+TODO: Point to grid search example.
+
+
+Applications and examples
+-------------------------
+
+The bag of words representation is quite simplistic but surprisingly
+useful in practice.
+
+In particular in a **supervised setting** it can be successfully combined
+with fast and scalable linear models to train **document classificers**,
+for instance:
+
+ * :ref:`example_document_classification_20newsgroups.py`
+
+In an **unsupervised setting** it can be used to group similar documents
+together by applying clustering algorithms such as :ref:`k_means`:
+
+  * :ref:`example_document_clustering.py`
+
+Finally is possible to discover the main topics of a corpus by relaxing
+the hard assignement constraint of clustering, for instance by using
+:ref:`NMF`:
+
+  * :ref:`example_applications_topics_extraction_with_nmf.py`
+
+
+Limitations of the Bag of Words representation
+----------------------------------------------
+
+While some local positioning information can be preserved by extracting
+n-grams instead of individual words, Bag of Words and Bag of n-grams
+destroy most of the inner structure of the document and hence most of
+the meaning carried by that internal structure.
+
+In order to address the wider task of Natural Language Understanding,
+the local structure of sentences and paragraphs should thus be taken
+into account. Many such models will thus be casted as "Structured Ouput"
+problems which are currently outside of the scope of scikit-learn.
 
 
 Image feature extraction