Skip to content
Snippets Groups Projects
Commit abddaa8c authored by Rob Speer's avatar Rob Speer Committed by Lars Buitinck
Browse files

ENH Add filters on newsgroup text

It is easy to overfit on the 20newsgroups dataset, by letting
classifiers learn from metadata that commonly appears in newsgroup
texts, but would be useless for identifying topics outside of this set
of newsgroups in 1993.

For example, many classifiers will tell you that three of the most
informative features are "nntp", "posting", and "host", because the
NNTP-Posting-Host header appears with different frequency in different
groups.

The fetch_20newsgroups function now allows you to ask for any of the
following kinds of text to be removed:

- Newsgroup headers (which contain lots of NNTP metadata that can
  identify the group)
- Signature blocks (which often contain multiple terms that uniquely
  identify the person posting, which in turn identifies the group)
- Quote blocks (which contain people's e-mail addresses and large
  amounts of text from another post in the same newsgroup)

The 20newsgroups classification example takes the "--filtered" flag,
which will remove all of these. This noticeably decreases the accuracy
of all classifiers, leaving room for a better method to improve the
accuracy.
parent f1f0a1a2
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment