Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
scikit-learn
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Ian Johnson
scikit-learn
Commits
8489c330
Commit
8489c330
authored
11 years ago
by
Raul Garreta
Committed by
Ignacio Rossi
10 years ago
Browse files
Options
Downloads
Patches
Plain Diff
added a new section on model persistence
parent
afcb384e
Branches
Branches containing commit
Tags
Tags containing commit
No related merge requests found
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
doc/model_persistence.rst
+85
-0
85 additions, 0 deletions
doc/model_persistence.rst
doc/tutorial/basic/tutorial.rst
+4
-0
4 additions, 0 deletions
doc/tutorial/basic/tutorial.rst
doc/user_guide.rst
+1
-0
1 addition, 0 deletions
doc/user_guide.rst
with
90 additions
and
0 deletions
doc/model_persistence.rst
0 → 100644
+
85
−
0
View file @
8489c330
.. _model_persistence:
=================
Model persistence
=================
After training a scikit-learn model, it is desirable to have a way to persist
the model for future use without having to retrain. The following section gives
you an example of how to persist a model with pickle. We'll also review a few
security and maintainability issues when working with pickle serialization.
Persistence example
-------------------
It is possible to save a model in the scikit by using Python's built-in
persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0])
array([0])
>>> y[0]
0
In the specific case of the scikit, it may be more interesting to use
joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
which is more efficient on big data, but can only pickle to the disk
and not to a string::
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
Security & maintainability limitations
--------------------------------------
You must be aware that pickle has some issues regarding maintainability and
security. From the **maintainability** point of view, you should take care the
issues that may arise if you upgrade your sklearn library while still loading a
model that was trained with a previous version, the model may have a code
structure that could not be compatible with newer versions and thus, don't work.
The same issue could also happen if you upgrade numpy or scipy versions.
A good practice is to save the scikit-learn, numpy and scipy versions to know
exactly what versions have been used to generate the model. You can do that, for
example, by executing a ``pip freeze`` command and saving the output to a text
file which should be stored together with your pickles.
Also, save a snapshot of your data to make it possible to retrain the model
if incompatibility issues arise when upgrading the libraries.
Regarding **security** issues, you may know that pickle is implemented with a
stack machine that executes instructions. As a difference with other
serialization methods like JSON, BSON, YAML, etc, which are all data oriented,
pickle is instruction oriented. Pickle serializes objects by persisting a set of
instructions that will be then executed at deserialization time in order to
reconstruct your objects. In fact, as part of the deserialization process,
pickle could call any arbitrary function, which opens up security
vulnerabilities against any malicious data or exploits.
Here is the warning from the official pickle documentation:
.. warning::
The pickle module is not intended to be secure against erroneous or
maliciously constructed data. Never unpickle data received from an untrusted
or unauthenticated source.
If you want to know more about these issues and explore other possible
serialization methods, please refer to this
`talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.
\ No newline at end of file
This diff is collapsed.
Click to expand it.
doc/tutorial/basic/tutorial.rst
+
4
−
0
View file @
8489c330
...
...
@@ -234,3 +234,7 @@ and not to a string::
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
It's important for you to know that pickle has some security and maintainability
issues. Please refer to section :ref:`model_persistence` for more detailed
information about model persistence with scikit-learn.
This diff is collapsed.
Click to expand it.
doc/user_guide.rst
+
1
−
0
View file @
8489c330
...
...
@@ -22,3 +22,4 @@
Dataset loading utilities <datasets/index.rst>
modules/scaling_strategies.rst
modules/computational_performance.rst
model_persistence.rst
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment