Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
  • 0.19.X
  • discrete
  • 0.18.X
  • ignore_lambda_to_diff_errors
  • 0.17.X
  • authors-update
  • 0.16.X
  • 0.15.X
  • 0.14.X
  • debian
  • 0.13.X
  • 0.12.X
  • 0.11.X
  • 0.10.X
  • 0.9.X
  • 0.6.X
  • 0.7.X
  • 0.8.X
  • 0.19.1
  • 0.19.0
  • 0.19b2
  • 0.19b1
  • 0.19-branching
  • 0.18.2
  • 0.18.1
  • 0.18
  • 0.18rc2
  • 0.18rc1
  • 0.18rc
  • 0.17.1-1
  • 0.17.1
  • debian/0.17.0-4
  • debian/0.17.0-3
  • debian/0.17.0-1
  • 0.17
  • debian/0.17.0_b1+git14-g4e6829c-1
  • debian/0.17.0_b1-1
  • 0.17b1
39 results

twenty_newsgroups.rst

Blame
  • twenty_newsgroups.rst 3.64 KiB

    The 20 newsgroups text dataset

    The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics splitted in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

    The 20 newsgroups dataset is also available through the generic mldata dataset loader introduced earlier. However mldata provides a version where the data is already vectorized.

    This is not the case for this loader. Instead, it returns the list of the raw text files that can be fed to text feature extractors such as :class:`sklearn.feature_extraction.text.Vectorizer` with custom parameters so as to extract feature vectors.

    Usage

    The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups website, extracts the archive contents in the ~/scikit_learn_data/20news_home folder and calls the sklearn.datasets.load_file on either the training or testing set folder, or both of them:

    >>> from sklearn.datasets import fetch_20newsgroups
    >>> newsgroups_train = fetch_20newsgroups(subset='train')
    
    >>> from pprint import pprint
    >>> pprint(list(newsgroups_train.target_names))
    ['alt.atheism',
     'comp.graphics',
     'comp.os.ms-windows.misc',
     'comp.sys.ibm.pc.hardware',
     'comp.sys.mac.hardware',
     'comp.windows.x',
     'misc.forsale',
     'rec.autos',
     'rec.motorcycles',
     'rec.sport.baseball',
     'rec.sport.hockey',
     'sci.crypt',
     'sci.electronics',
     'sci.med',
     'sci.space',
     'soc.religion.christian',
     'talk.politics.guns',
     'talk.politics.mideast',
     'talk.politics.misc',
     'talk.religion.misc']

    The real data lies in the filenames and target attributes. The target attribute is the integer index of the category:

    >>> newsgroups_train.filenames.shape
    (11314,)
    >>> newsgroups_train.target.shape
    (11314,)
    >>> newsgroups_train.target[:10]
    array([12,  6,  9,  8,  6,  7,  9,  2, 13, 19])

    It is possible to load only a sub-selection of the categories by passing the list of the categories to load to the fetch_20newsgroups function:

    >>> cats = ['alt.atheism', 'sci.space']
    >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
    
    >>> list(newsgroups_train.target_names)
    ['alt.atheism', 'sci.space']
    >>> newsgroups_train.filenames.shape
    (1073,)
    >>> newsgroups_train.target.shape
    (1073,)
    >>> newsgroups_train.target[:10]
    array([1, 1, 1, 0, 1, 0, 0, 1, 1, 1])

    In order to feed predictive or clustering models with the text data, one first need to turn the text into vectors of numerical values suitable for statistical analysis. This can be achieved with the utilities of the sklearn.feature_extraction.text as demonstrated in the following example that extract TF-IDF vectors of unigram tokens:

    >>> from sklearn.feature_extraction.text import Vectorizer
    >>> documents = [open(f).read() for f in newsgroups_train.filenames]
    >>> vectorizer = Vectorizer()
    >>> vectors = vectorizer.fit_transform(documents)
    >>> vectors.shape
    (1073, 21108)

    The extracted TF-IDF vectors are very sparse with an average of 118 non zero components by sample in a more than 20000 dimensional space (less than 1% non zero features):

    >>> vectors.nnz / vectors.shape[0]
    118

    Examples

    :ref:`example_grid_search_text_feature_extraction.py`

    :ref:`example_document_classification_20newsgroups.py`