-
- Downloads
ENH Add filters on newsgroup text
It is easy to overfit on the 20newsgroups dataset, by letting classifiers learn from metadata that commonly appears in newsgroup texts, but would be useless for identifying topics outside of this set of newsgroups in 1993. For example, many classifiers will tell you that three of the most informative features are "nntp", "posting", and "host", because the NNTP-Posting-Host header appears with different frequency in different groups. The fetch_20newsgroups function now allows you to ask for any of the following kinds of text to be removed: - Newsgroup headers (which contain lots of NNTP metadata that can identify the group) - Signature blocks (which often contain multiple terms that uniquely identify the person posting, which in turn identifies the group) - Quote blocks (which contain people's e-mail addresses and large amounts of text from another post in the same newsgroup) The 20newsgroups classification example takes the "--filtered" flag, which will remove all of these. This noticeably decreases the accuracy of all classifiers, leaving room for a better method to improve the accuracy.
Showing
- doc/datasets/twenty_newsgroups.rst 118 additions, 6 deletionsdoc/datasets/twenty_newsgroups.rst
- examples/document_classification_20newsgroups.py 13 additions, 3 deletionsexamples/document_classification_20newsgroups.py
- sklearn/datasets/twenty_newsgroups.py 85 additions, 5 deletionssklearn/datasets/twenty_newsgroups.py
Loading
Please register or sign in to comment