22
Use Scikit-Learn and Runflow
If you're not familiar with Scikit-learn and Runflow,
- Scikit-learn is a simple and efficient tool for predictive data analysis.
- Runflow is a tool to define and run workflows.
By mix using both, your machine learning code will be organized better.
If you just simply follow the code snippets shown on scikit-learn documentation,
you will quickly get into some issues.
- With more complexity added, the code doesn't scale well.
- The code is hard to maintain and read.
- You need to deal with where to load and save the data.
- You need to track the change of models and parameters over time.
- Code change in the middle of the script may break the following code. Spend a lot of time troubleshooting.
Let's see how Runflow solves this issue.
Step 1, split your code into a minimal chunk of classes and functions so they're easy to be re-used. And more importantly, the error is contained in a scope.
# File: examples/sklearn_refactored_script.py
from sklearn.datasets import load_boston
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
import pandas as pd
class ExtractTrainingSet:
def run(self):
df = load_boston()
return dict(
x=pd.DataFrame(df.data, columns=df.feature_names),
y=pd.DataFrame(df.target, columns=['target'])
)
class TrainModel:
MODELS = {
'ols': LinearRegression,
'gbm': GradientBoostingRegressor,
}
def __init__(self, model, x, y):
if model not in self.MODELS:
raise ValueError(f'invalid model: {model}')
self.model = model
self.x = x
self.y = y
def run(self):
model = self.MODELS[self.model]()
model.fit(self.x, self.y)
score = model.score(self.x, self.y)
return {'score': score}
Step 2, define a data flow using Runflow.
# File: examples/ml_example.hcl
flow "machine_learning_example" {
import {
tasks = {
extract = "examples.sklearn_refactored_script:ExtractTrainingSet"
train_model = "examples.sklearn_refactored_script:TrainModel"
}
}
task "extract" "setup" {
}
task "train_model" "model1" {
model = "ols"
x = task.extract.setup.x
y = task.extract.setup.y
}
task "train_model" "model2" {
model = "gbm"
x = task.extract.setup.x
y = task.extract.setup.y
}
task "file_write" "output" {
filename = "/dev/stdout"
content = tojson({
scores = {
ols = task.train_model.model1.score
gbm = task.train_model.model2.score
}
}, { indent = 2 }...)
}
}
Run:
$ python3 -mvenv venv
$ source venv/bin/activate
$ pip install runflow
$ python3 -mrunflow run examples/ml_example.hcl
[2021-07-06 23:15:19,999] "task.extract.setup" is started.
[2021-07-06 23:15:20,006] "task.extract.setup" is successful.
[2021-07-06 23:15:20,006] "task.train_model.model2" is started.
[2021-07-06 23:15:20,144] "task.train_model.model2" is successful.
[2021-07-06 23:15:20,144] "task.train_model.model1" is started.
[2021-07-06 23:15:20,151] "task.train_model.model1" is successful.
[2021-07-06 23:15:20,152] "task.file_write.output" is started.
{
"scores": {
"ols": 0.7406426641094095,
"gbm": 0.9761405838418584
}
}
[2021-07-06 23:15:20,153] "task.file_write.output" is successful.
Writing machine learning code in a spaghetti coding style may create problems for you.
Considering the complex dependencies, it's better to define them as a flow of tasks.
Runflow is one of the competitive solutions.
22