This post is part of our data science series
At Faraday we use Dataiku to do ad hoc exploratory data science work, and especially for investigating new predictive techniques before building them into our platform.
Dataiku is awesome and has an incredibly responsive team. One drawback for me, however, has been Dataiku's lack of support for PMML, a standard serialization format for predictive models and their associated apparatus.
Luckily with a little hacking you can export a Dataiku model to PMML. And this technique can work anywhere you have a scikit-learn-based model you're trying to export.
We're going to use Dataiku's built-in Python environment, which lives in your DSS data directory (generally
/Users/username/Library/DataScienceStudio/dss_home on a Mac). We need to add a couple libraries first:
$ cd $DSS_DATA_DIR $ ./bin/pip install sklearn_pandas $ ./bin/pip install git+https://github.com/jpmml/sklearn2pmml.git
You'll also need a working JDK. If this doesn't work:
$ java -version java version "1.8.0_121"
Then install a JDK. (On Mac:
brew cask install java.)
Locate your classifier
OK, now let's get our hands on the model you're trying to export. Maybe it's already in memory, but more likely it's pickled on disk. With Dataiku, you'll find your pickled classifier in a path that looks like this:
There it is,
clf.pkl. It's helpful to copy this file into your working dir so we don't accidentally disturb it.
Export the model to PMML
Now let's start up an interactive Python console — again using Dataiku's built-in environment:
$ cd $DSS_DATA_DIR $ ./bin/python Python 2.7.10 (default, Oct 23 2015, 19:19:21) >>>
First let's load up some libraries:
>>> from sklearn.externals import joblib >>> from sklearn2pmml import PMMLPipeline >>> from sklearn2pmml import sklearn2pmml
Now we'll unmarshal the model using joblib, a pickle-compatible serialization library:
>>> clf = joblib.load('/path/to/clf.pkl')
Here's the only tricky part: we have to wrap the trained estimator in a Pipeline-like object that sklearn2pmml understands. (This is likely to get less tricky soon.)
>>> pipeline = PMMLPipeline([ ... ("estimator", clf) ... ])
And finally perform the export:
>>> sklearn2pmml(pipeline, "clf.pmml") INFO: Parsing PKL.. [snip] INFO: Marshalled PMML in 714 ms.