diff --git a/doc/datasets/twenty_newsgroups.inc b/doc/datasets/twenty_newsgroups.inc index 99d1f613d5b4d86f5adb784a3ef234e7953ccd71..8d41f4a0aac8ab88755b35506a98c83e7cf24133 100644 --- a/doc/datasets/twenty_newsgroups.inc +++ b/doc/datasets/twenty_newsgroups.inc @@ -1,3 +1,5 @@ +.. _20newsgroups: + The 20 newsgroups text dataset ============================== diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst index e446eed81d1ff9e2024231f95d0bd6c048da3883..efe86133688a1874d1296684d81a7bfed3352f4b 100644 --- a/doc/modules/classes.rst +++ b/doc/modules/classes.rst @@ -273,6 +273,8 @@ From images feature_extraction.image.PatchExtractor +.. _text_feature_extraction_ref: + From text --------- @@ -286,9 +288,6 @@ From text :toctree: generated/ :template: class.rst - feature_extraction.text.RomanPreprocessor - feature_extraction.text.WordNGramAnalyzer - feature_extraction.text.CharNGramAnalyzer feature_extraction.text.CountVectorizer feature_extraction.text.TfidfTransformer feature_extraction.text.Vectorizer diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst index afa09358b4858c7da4ab1b52e88c20aec9283759..c443069afb4dc44a15cc0a06392f4396794535c0 100644 --- a/doc/modules/feature_extraction.rst +++ b/doc/modules/feature_extraction.rst @@ -17,7 +17,206 @@ Text feature extraction .. currentmodule:: sklearn.feature_extraction.text -XXX: a lot to do here + +The Bags of Words representation +-------------------------------- + +Natural Language Processing is a major application field for machine +learning algorithms. However the raw data, a sequence of symbols cannot +be fed directly to the algorithms themselves as most of them expect +numerical feature vectors with a fixed size rather than the raw text +documents with variable length. + +In order to address this, scikit-learn provides utilities for the most +common ways to extract numerical features from text content, namely: + +- **tokenizing** strings and giving an integer id for each possible token, + for instance by using whitespaces and punctuation as token separators. + +- **counting** the occurrences of tokens in each document. + +- **normalizing** and weighting with diminishing importance tokens that + occur in the majority of samples / documents. + +In this scheme, features and samples are defined as follows: + +- each **individual token occurrence frequency** (normalized or not) + is treated as a **feature**. + +- the vector of all the token frequencies for a given **document** is + considered a multivariate **sample**. + +A corpus of documents can thus be represented by a matrix with one row +per document and one column per token (e.g. word) occurring in the corpus. + +We call **vectorization** the general process of turning a collection of +text documents into numerical feature vectors. This specific stragegy +(tokenization, counting and normalization) is called the **Bags of +Words** representation as documents are descriped by word occurrences +while completely ignoring the relative position information of the words +in the document. + + +Sparsity +-------- + +As most documents will typically use a very subset of a the words used in +the corpus, the resulting matrix will have many feature values that are +zeros (typically more than 99% of them). + +For instance a collection of 10,000 short text documents (such as emails) +will use a vocabulary with a size in the order of 100,000 unique words in +total while each document will use 100 to 1000 unique words individually. + +In order to be able to store this such a matrix in memory but also to +speed up algebraic operations matrix / vector, implementations will +typically use a sparse representation such as the implementations +available in the ``scipy.sparse`` package. + + +Common Vectorizer usage +----------------------- + +:class:`CountVectorizer` implements both tokenization and occurrence +counting in a single class:: + + >>> from sklearn.feature_extraction.text import CountVectorizer + +This model has many parameters, however the default values are quite +reasonable (please refer to the :ref:`reference documentation +<text_feature_extraction_ref>` for the details):: + + >>> vectorizer = CountVectorizer() + >>> vectorizer + CountVectorizer(binary=False, charset='utf-8', charset_error='strict', + dtype=<type 'long'>, fixed_vocabulary=None, input='content', + lowercase=True, max_df=1.0, max_features=None, max_n=1, min_n=1, + stop_words=None, strip_accents='ascii', strip_tags=False, + token_pattern=u'\\b\\w\\w+\\b', tokenize='word') + +Let's use it to tokenize and count the word occurrences of a minimalistic +corpus of text documents:: + + >>> corpus = [ + ... 'This is the first document.', + ... 'This is the second second document.', + ... 'And the third one.', + ... 'Is this the first document?', + ... ] + >>> X = vectorizer.fit_transform(corpus) + >>> X # doctest: +NORMALIZE_WHITESPACE + <4x9 sparse matrix of type '<type 'numpy.int64'>' + with 19 stored elements in COOrdinate format> + +The default configuration tokenize the string by extracting words of +at least 2 letters. The specific function that does this step can be +requested explicitly:: + + >>> analyze = vectorizer.build_analyzer() + >>> analyze("This is a text document to analyze.") + ['this', 'is', 'text', 'document', 'to', 'analyze'] + +Each term found by the analyzer during the fit is assigned a unique +integer index to assign it a column in the resulting matrix. This +interpretation of the columns can be retrieved as follows:: + + >>> list(vectorizer.get_feature_names()) + ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] + + >>> X.toarray() + array([[0, 1, 1, 1, 0, 0, 1, 0, 1], + [0, 1, 0, 1, 0, 2, 1, 0, 1], + [1, 0, 0, 0, 1, 0, 1, 1, 0], + [0, 1, 1, 1, 0, 0, 1, 0, 1]]) + +The converse mapping from feature name to column index is stored in the +``vocabulary_`` attribute of the vectorizer:: + + >>> vectorizer.vocabulary_.get('document') + 1 + +Hence words that were not seen in the training corpus will be completely +ignored in future calls to the transform method:: + + >>> vectorizer.transform(['Something completely new.']).toarray() + array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]) + +Note that in the previous corpus, the first and the last documents have +exaclty the same words hence are encoded in equal vectors. In particular +we lose the info that the last document is an interogative form. To +preserve some of the local ordering information we can extract 2-grams +of words in addition to the 1-grams (the word themselvs):: + + >>> bigram_vectorizer = CountVectorizer(min_n=1, max_n=2, + ... token_pattern=ur'\b\w+\b') + >>> analyze = bigram_vectorizer.build_analyzer() + >>> analyze('Bi-grams are cool!') + [u'bi', u'grams', u'are', u'cool', u'bi grams', u'grams are', u'are cool'] + +The vocabulary extracted by this vectorizer is hence much bigger and +can now resolve ambiguities encoded in local positioning patterns:: + + >>> X_2 = bigram_vectorizer.fit_transform(corpus) + >>> X_2.toarray() + array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0], + [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0], + [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0], + [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]) + + +In particular the interogative form "Is this" is only present in the +last document:: + + >>> bigram_vectorizer.vocabulary_.get(u'is this') + 7 + + +TF-IDF normalization +-------------------- + +TODO: Explain + +TODO: Limitations of TF-IDF: short text, binary occurrences better + +TODO: Point to grid search example. + + +Applications and examples +------------------------- + +The bag of words representation is quite simplistic but surprisingly +useful in practice. + +In particular in a **supervised setting** it can be successfully combined +with fast and scalable linear models to train **document classificers**, +for instance: + + * :ref:`example_document_classification_20newsgroups.py` + +In an **unsupervised setting** it can be used to group similar documents +together by applying clustering algorithms such as :ref:`k_means`: + + * :ref:`example_document_clustering.py` + +Finally is possible to discover the main topics of a corpus by relaxing +the hard assignement constraint of clustering, for instance by using +:ref:`NMF`: + + * :ref:`example_applications_topics_extraction_with_nmf.py` + + +Limitations of the Bag of Words representation +---------------------------------------------- + +While some local positioning information can be preserved by extracting +n-grams instead of individual words, Bag of Words and Bag of n-grams +destroy most of the inner structure of the document and hence most of +the meaning carried by that internal structure. + +In order to address the wider task of Natural Language Understanding, +the local structure of sentences and paragraphs should thus be taken +into account. Many such models will thus be casted as "Structured Ouput" +problems which are currently outside of the scope of scikit-learn. Image feature extraction