From a75949a22cf2223bac720fe14372f35e524c67e1 Mon Sep 17 00:00:00 2001
From: Lars Buitinck <larsmans@gmail.com>
Date: Mon, 22 Jul 2013 12:46:30 +0200
Subject: [PATCH] DOC 20news filtering with smaller set and MultinomialNB

BernoulliNB in its present form has a hard time showing the good
features, because it isn't really a linear model (XXX fix this).

A smaller test makes this easier to reproduce for users.
---
 doc/datasets/twenty_newsgroups.rst | 77 +++++++++++++-----------------
 1 file changed, 34 insertions(+), 43 deletions(-)

diff --git a/doc/datasets/twenty_newsgroups.rst b/doc/datasets/twenty_newsgroups.rst
index 79990ebbb9..2da149b10d 100644
--- a/doc/datasets/twenty_newsgroups.rst
+++ b/doc/datasets/twenty_newsgroups.rst
@@ -86,21 +86,25 @@ In order to feed predictive or clustering models with the text data,
 one first need to turn the text into vectors of numerical values suitable
 for statistical analysis. This can be achieved with the utilities of the
 ``sklearn.feature_extraction.text`` as demonstrated in the following
-example that extract `TF-IDF`_ vectors of unigram tokens::
+example that extract `TF-IDF`_ vectors of unigram tokens
+from a subset of 20news::
 
   >>> from sklearn.feature_extraction.text import TfidfVectorizer
-  >>> newsgroups_train = fetch_20newsgroups(subset='train')
+  >>> categories = ['alt.atheism', 'talk.religion.misc',
+  ...               'comp.graphics', 'sci.space']
+  >>> newsgroups_train = fetch_20newsgroups(subset='train',
+  ...                                       categories=categories)
   >>> vectorizer = TfidfVectorizer()
   >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
   >>> vectors.shape
-  (11314, 129792)
+  (2034, 34118)
 
-The extracted TF-IDF vectors are very sparse, with an average of 110 non zero
-components by sample in a more than 20000 dimensional space (less than 1% non
-zero features)::
+The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero
+components by sample in a more than 30000-dimensional space
+(less than .5% non-zero features)::
 
-  >>> vectors.nnz / vectors.shape[0]
-  110
+  >>> vectors.nnz / float(vectors.shape[0])
+  159.01327433628319
 
 ``sklearn.datasets.fetch_20newsgroups_vectorized`` is a function which returns
 ready-to-use tfidf features instead of file names.
@@ -116,22 +120,23 @@ It is easy for a classifier to overfit on particular things that appear in the
 high F-scores, but their results would not generalize to other documents that
 aren't from this window of time.
 
-For example, let's look at the results of a Bernoulli Naive Bayes classifier,
+For example, let's look at the results of a multinomial Naive Bayes classifier,
 which is fast to train and achieves a decent F-score::
 
   >>> from sklearn.naive_bayes import BernoulliNB
   >>> from sklearn import metrics
-  >>> newsgroups_test = fetch_20newsgroups(subset='test')
+  >>> newsgroups_test = fetch_20newsgroups(subset='test',
+  ...                                      categories=categories)
   >>> vectors_test = vectorizer.transform(newsgroups_test.data)
-  >>> clf = BernoulliNB(alpha=.01)
+  >>> clf = MultinomialNB(alpha=.01)
   >>> clf.fit(vectors, newsgroups_train.target)
   >>> pred = clf.predict(vectors_test)
-  >>> metrics.f1_score(pred, newsgroups_test.target)
-  0.78117467868044399
+  >>> metrics.f1_score(newsgroups_test.target, pred)
+  0.88251152461278892
 
 (The example :ref:`example_document_classification_20newsgroups.py` shuffles
 the training and test data, instead of segmenting by time, and in that case
-Bernoulli Naive Bayes gets a much higher F-score of 0.88. Are you suspicious
+multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious
 yet of what's going on inside this classifier?)
 
 Let's take a look at what the most informative features are:
@@ -144,26 +149,10 @@ Let's take a look at what the most informative features are:
   ...         print("%s: %s" % (category, " ".join(feature_names[top10])))
   ...  
   >>> show_top10(clf, vectorizer, newsgroups_train.target_names)
-  alt.atheism: god say think people don com nntp host posting article
-  comp.graphics: like com article know thanks graphics university nntp host posting
-  comp.os.ms-windows.misc: know thanks use com article nntp host posting university windows
-  comp.sys.ibm.pc.hardware: just does article know thanks com university nntp host posting
-  comp.sys.mac.hardware: thanks does apple know article mac university nntp host posting
-  comp.windows.x: like article use reply thanks window com nntp posting host
-  misc.forsale: com usa mail new distribution nntp host posting university sale
-  rec.autos: distribution like just university nntp host posting car article com
-  rec.motorcycles: don just like bike nntp host posting dod com article
-  rec.sport.baseball: think just year com baseball host nntp posting university article
-  rec.sport.hockey: nhl game hockey team nntp host article ca posting university
-  sci.crypt: just nntp host encryption posting article chip key clipper com
-  sci.electronics: does like know university use article com nntp host posting
-  sci.med: like reply university nntp don host know posting com article
-  sci.space: nasa like university just com nntp host posting space article
-  soc.religion.christian: just don like rutgers know university article think people god
-  talk.politics.guns: like university don gun people nntp host posting article com
-  talk.politics.mideast: like just nntp host israeli university posting israel people article
-  talk.politics.misc: like university nntp host just don posting people com article
-  talk.religion.misc: think know christian posting god people just don article com
+  alt.atheism: sgi livesey atheists writes people caltech com god keith edu
+  comp.graphics: organization thanks files subject com image lines university edu graphics
+  sci.space: toronto moon gov com alaska access henry nasa edu space
+  talk.religion.misc: article writes kent people christian jesus sandvik edu com god
 
 You can now see many things that these features have overfit to:
 
@@ -188,25 +177,27 @@ of each file. **remove** should be a tuple containing any subset of
 blocks, and quotation blocks respectively.
 
   >>> newsgroups_test = fetch_20newsgroups(subset='test', 
-  ...                                      remove=('headers', 'footers', 'quotes'))
+  ...                                      remove=('headers', 'footers', 'quotes'),
+  ...                                      categories=categories)
   >>> vectors_test = vectorizer.transform(newsgroups_test.data)
   >>> pred = clf.predict(vectors_test)
   >>> metrics.f1_score(pred, newsgroups_test.target)
-  0.51830104911679742
+  0.78409163025839435
 
-This classifier lost over a third of its F-score, just because we removed
-metadata that has little to do with topic classification. It recovers only a
-bit if we also strip this metadata from the training data:
+This classifier lost over a lot of its F-score, just because we removed
+metadata that has little to do with topic classification.
+It loses even more if we also strip this metadata from the training data:
 
   >>> newsgroups_train = fetch_20newsgroups(subset='train',
-                                            remove=('headers', 'footers', 'quotes'))
+  ...                                       remove=('headers', 'footers', 'quotes'),
+  ...                                       categories=categories)
   >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
   >>> clf = BernoulliNB(alpha=.01)
   >>> clf.fit(vectors, newsgroups_train.target)
   >>> vectors_test = vectorizer.transform(newsgroups_test.data)
   >>> pred = clf.predict(vectors_test)
-  >>> metrics.f1_score(pred, newsgroups_test.target)
-  0.56907392353755404
+  >>> metrics.f1_score(newsgroups_test.target, pred)
+  0.73160869205141166
 
 Some other classifiers cope better with this harder version of the task. Try
 running :ref:`example_grid_search_text_feature_extraction.py` with and without
@@ -214,7 +205,7 @@ the ``--filter`` option to compare the results.
 
 .. topic:: Recommendation
 
-  When evaluating natural language classifiers on the 20 Newsgroups data, you
+  When evaluating text classifiers on the 20 Newsgroups data, you
   should strip newsgroup-related metadata. In scikit-learn, you can do this by
   setting ``remove=('headers', 'footers', 'quotes')``. The F-score will be
   lower because it is more realistic.
-- 
GitLab