Skip to content
Snippets Groups Projects
Select Git revision
0 results

datasets

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Rob Speer authored
    It is easy to overfit on the 20newsgroups dataset, by letting
    classifiers learn from metadata that commonly appears in newsgroup
    texts, but would be useless for identifying topics outside of this set
    of newsgroups in 1993.
    
    For example, many classifiers will tell you that three of the most
    informative features are "nntp", "posting", and "host", because the
    NNTP-Posting-Host header appears with different frequency in different
    groups.
    
    The fetch_20newsgroups function now allows you to ask for any of the
    following kinds of text to be removed:
    
    - Newsgroup headers (which contain lots of NNTP metadata that can
      identify the group)
    - Signature blocks (which often contain multiple terms that uniquely
      identify the person posting, which in turn identifies the group)
    - Quote blocks (which contain people's e-mail addresses and large
      amounts of text from another post in the same newsgroup)
    
    The 20newsgroups classification example takes the "--filtered" flag,
    which will remove all of these. This noticeably decreases the accuracy
    of all classifiers, leaving room for a better method to improve the
    accuracy.
    abddaa8c
    History
    Name Last commit Last update
    ..