Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
scikit-learn
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Ian Johnson
scikit-learn
Commits
21e5c2b2
Commit
21e5c2b2
authored
14 years ago
by
Olivier Grisel
Browse files
Options
Downloads
Patches
Plain Diff
better way to load folder / files dataset
parent
1d7be88e
No related branches found
No related tags found
No related merge requests found
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
scikits/learn/datasets/base.py
+20
-9
20 additions, 9 deletions
scikits/learn/datasets/base.py
scikits/learn/datasets/mlcomp.py
+2
-2
2 additions, 2 deletions
scikits/learn/datasets/mlcomp.py
with
22 additions
and
11 deletions
scikits/learn/datasets/base.py
+
20
−
9
View file @
21e5c2b2
...
...
@@ -22,11 +22,11 @@ class Bunch(dict):
self
.
__dict__
=
self
def
load_
text_
files
(
container_path
,
description
):
"""
Load
text document
files with categories as subfolder names
def
load_files
(
container_path
,
description
=
None
,
categories
=
None
):
"""
Load files with categories as subfolder names
Individual samples are assumed to be
utf-8 encoded text files in
a two level
folder
structure such as the following:
Individual samples are assumed to be
files stored
a two level
s folder
structure such as the following:
container_folder/
category_1_folder/
...
...
@@ -42,13 +42,16 @@ def load_text_files(container_path, description):
The folder names are used has supervised signal label names. The indivial
file names are not important.
This function does not try to
load the tex
t features into a numpy array or
scipy sparse matrix, nor does it try to load the
text
in memory.
This function does not try to
extrac
t features into a numpy array or
scipy sparse matrix, nor does it try to load the
files
in memory.
The use text files in a scikit-learn classification or clustering algorithm
you will first need to use the `scikits.learn.features.text` module to build
a feature extraction transformer that suits your problem.
To use utf-8 text files in a scikit-learn classification or clustering
algorithm you will first need to use the `scikits.learn.features.text`
module to build a feature extraction transformer that suits your
problem.
Similar feature extractors should be build for other kind of unstructured
data input such as images, audio, video, ...
Parameters
----------
...
...
@@ -60,6 +63,10 @@ def load_text_files(container_path, description):
a paragraph describing the characteristic of the dataset, its source,
reference, ...
categories : None or collection of string or unicode
if None (default), load all the categories.
if not Non, list of category names to load (other categories ignored)
Returns
-------
...
...
@@ -77,6 +84,10 @@ def load_text_files(container_path, description):
folders
=
[
f
for
f
in
sorted
(
os
.
listdir
(
container_path
))
if
os
.
path
.
isdir
(
os
.
path
.
join
(
container_path
,
f
))]
if
categories
is
not
None
:
folders
=
[
f
for
f
in
folders
if
f
in
categories
]
for
label
,
folder
in
enumerate
(
folders
):
target_names
[
label
]
=
folder
folder_path
=
os
.
path
.
join
(
container_path
,
folder
)
...
...
This diff is collapsed.
Click to expand it.
scikits/learn/datasets/mlcomp.py
+
2
−
2
View file @
21e5c2b2
...
...
@@ -4,7 +4,7 @@
import
os
import
numpy
as
np
from
scikits.learn.datasets.base
import
load_
text_
files
from
scikits.learn.datasets.base
import
load_files
from
scikits.learn.feature_extraction.text
import
HashingVectorizer
from
scikits.learn.feature_extraction.text.sparse
import
HashingVectorizer
as
\
SparseCountVectorizer
...
...
@@ -13,7 +13,7 @@ from scikits.learn.feature_extraction.text.sparse import HashingVectorizer as \
def
_load_document_classification
(
dataset_path
,
metadata
,
set_
=
None
):
if
set_
is
not
None
:
dataset_path
=
os
.
path
.
join
(
dataset_path
,
set_
)
return
load_
text_
files
(
dataset_path
,
metadata
.
get
(
'
description
'
))
return
load_files
(
dataset_path
,
metadata
.
get
(
'
description
'
))
LOADERS
=
{
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment