From a75949a22cf2223bac720fe14372f35e524c67e1 Mon Sep 17 00:00:00 2001 From: Lars Buitinck <larsmans@gmail.com> Date: Mon, 22 Jul 2013 12:46:30 +0200 Subject: [PATCH] DOC 20news filtering with smaller set and MultinomialNB BernoulliNB in its present form has a hard time showing the good features, because it isn't really a linear model (XXX fix this). A smaller test makes this easier to reproduce for users. --- doc/datasets/twenty_newsgroups.rst | 77 +++++++++++++----------------- 1 file changed, 34 insertions(+), 43 deletions(-) diff --git a/doc/datasets/twenty_newsgroups.rst b/doc/datasets/twenty_newsgroups.rst index 79990ebbb9..2da149b10d 100644 --- a/doc/datasets/twenty_newsgroups.rst +++ b/doc/datasets/twenty_newsgroups.rst @@ -86,21 +86,25 @@ In order to feed predictive or clustering models with the text data, one first need to turn the text into vectors of numerical values suitable for statistical analysis. This can be achieved with the utilities of the ``sklearn.feature_extraction.text`` as demonstrated in the following -example that extract `TF-IDF`_ vectors of unigram tokens:: +example that extract `TF-IDF`_ vectors of unigram tokens +from a subset of 20news:: >>> from sklearn.feature_extraction.text import TfidfVectorizer - >>> newsgroups_train = fetch_20newsgroups(subset='train') + >>> categories = ['alt.atheism', 'talk.religion.misc', + ... 'comp.graphics', 'sci.space'] + >>> newsgroups_train = fetch_20newsgroups(subset='train', + ... categories=categories) >>> vectorizer = TfidfVectorizer() >>> vectors = vectorizer.fit_transform(newsgroups_train.data) >>> vectors.shape - (11314, 129792) + (2034, 34118) -The extracted TF-IDF vectors are very sparse, with an average of 110 non zero -components by sample in a more than 20000 dimensional space (less than 1% non -zero features):: +The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero +components by sample in a more than 30000-dimensional space +(less than .5% non-zero features):: - >>> vectors.nnz / vectors.shape[0] - 110 + >>> vectors.nnz / float(vectors.shape[0]) + 159.01327433628319 ``sklearn.datasets.fetch_20newsgroups_vectorized`` is a function which returns ready-to-use tfidf features instead of file names. @@ -116,22 +120,23 @@ It is easy for a classifier to overfit on particular things that appear in the high F-scores, but their results would not generalize to other documents that aren't from this window of time. -For example, let's look at the results of a Bernoulli Naive Bayes classifier, +For example, let's look at the results of a multinomial Naive Bayes classifier, which is fast to train and achieves a decent F-score:: >>> from sklearn.naive_bayes import BernoulliNB >>> from sklearn import metrics - >>> newsgroups_test = fetch_20newsgroups(subset='test') + >>> newsgroups_test = fetch_20newsgroups(subset='test', + ... categories=categories) >>> vectors_test = vectorizer.transform(newsgroups_test.data) - >>> clf = BernoulliNB(alpha=.01) + >>> clf = MultinomialNB(alpha=.01) >>> clf.fit(vectors, newsgroups_train.target) >>> pred = clf.predict(vectors_test) - >>> metrics.f1_score(pred, newsgroups_test.target) - 0.78117467868044399 + >>> metrics.f1_score(newsgroups_test.target, pred) + 0.88251152461278892 (The example :ref:`example_document_classification_20newsgroups.py` shuffles the training and test data, instead of segmenting by time, and in that case -Bernoulli Naive Bayes gets a much higher F-score of 0.88. Are you suspicious +multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious yet of what's going on inside this classifier?) Let's take a look at what the most informative features are: @@ -144,26 +149,10 @@ Let's take a look at what the most informative features are: ... print("%s: %s" % (category, " ".join(feature_names[top10]))) ... >>> show_top10(clf, vectorizer, newsgroups_train.target_names) - alt.atheism: god say think people don com nntp host posting article - comp.graphics: like com article know thanks graphics university nntp host posting - comp.os.ms-windows.misc: know thanks use com article nntp host posting university windows - comp.sys.ibm.pc.hardware: just does article know thanks com university nntp host posting - comp.sys.mac.hardware: thanks does apple know article mac university nntp host posting - comp.windows.x: like article use reply thanks window com nntp posting host - misc.forsale: com usa mail new distribution nntp host posting university sale - rec.autos: distribution like just university nntp host posting car article com - rec.motorcycles: don just like bike nntp host posting dod com article - rec.sport.baseball: think just year com baseball host nntp posting university article - rec.sport.hockey: nhl game hockey team nntp host article ca posting university - sci.crypt: just nntp host encryption posting article chip key clipper com - sci.electronics: does like know university use article com nntp host posting - sci.med: like reply university nntp don host know posting com article - sci.space: nasa like university just com nntp host posting space article - soc.religion.christian: just don like rutgers know university article think people god - talk.politics.guns: like university don gun people nntp host posting article com - talk.politics.mideast: like just nntp host israeli university posting israel people article - talk.politics.misc: like university nntp host just don posting people com article - talk.religion.misc: think know christian posting god people just don article com + alt.atheism: sgi livesey atheists writes people caltech com god keith edu + comp.graphics: organization thanks files subject com image lines university edu graphics + sci.space: toronto moon gov com alaska access henry nasa edu space + talk.religion.misc: article writes kent people christian jesus sandvik edu com god You can now see many things that these features have overfit to: @@ -188,25 +177,27 @@ of each file. **remove** should be a tuple containing any subset of blocks, and quotation blocks respectively. >>> newsgroups_test = fetch_20newsgroups(subset='test', - ... remove=('headers', 'footers', 'quotes')) + ... remove=('headers', 'footers', 'quotes'), + ... categories=categories) >>> vectors_test = vectorizer.transform(newsgroups_test.data) >>> pred = clf.predict(vectors_test) >>> metrics.f1_score(pred, newsgroups_test.target) - 0.51830104911679742 + 0.78409163025839435 -This classifier lost over a third of its F-score, just because we removed -metadata that has little to do with topic classification. It recovers only a -bit if we also strip this metadata from the training data: +This classifier lost over a lot of its F-score, just because we removed +metadata that has little to do with topic classification. +It loses even more if we also strip this metadata from the training data: >>> newsgroups_train = fetch_20newsgroups(subset='train', - remove=('headers', 'footers', 'quotes')) + ... remove=('headers', 'footers', 'quotes'), + ... categories=categories) >>> vectors = vectorizer.fit_transform(newsgroups_train.data) >>> clf = BernoulliNB(alpha=.01) >>> clf.fit(vectors, newsgroups_train.target) >>> vectors_test = vectorizer.transform(newsgroups_test.data) >>> pred = clf.predict(vectors_test) - >>> metrics.f1_score(pred, newsgroups_test.target) - 0.56907392353755404 + >>> metrics.f1_score(newsgroups_test.target, pred) + 0.73160869205141166 Some other classifiers cope better with this harder version of the task. Try running :ref:`example_grid_search_text_feature_extraction.py` with and without @@ -214,7 +205,7 @@ the ``--filter`` option to compare the results. .. topic:: Recommendation - When evaluating natural language classifiers on the 20 Newsgroups data, you + When evaluating text classifiers on the 20 Newsgroups data, you should strip newsgroup-related metadata. In scikit-learn, you can do this by setting ``remove=('headers', 'footers', 'quotes')``. The F-score will be lower because it is more realistic. -- GitLab