Use Scikit-Learn and Runflow

If you're not familiar with Scikit-learn and Runflow,

Scikit-learn is a simple and efficient tool for predictive data analysis.

Runflow is a tool to define and run workflows.

By mix using both, your machine learning code will be organized better.

Why Runflow is Needed?

If you just simply follow the code snippets shown on scikit-learn documentation,
you will quickly get into some issues.

With more complexity added, the code doesn't scale well.

The code is hard to maintain and read.

You need to deal with where to load and save the data.

You need to track the change of models and parameters over time.

Code change in the middle of the script may break the following code. Spend a lot of time troubleshooting.

How to Improve Your Machine Learning Code?

Let's see how Runflow solves this issue.

Step 1, split your code into a minimal chunk of classes and functions so they're easy to be re-used. And more importantly, the error is contained in a scope.

# File: examples/sklearn_refactored_script.py
from sklearn.datasets import load_boston
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
import pandas as pd

class ExtractTrainingSet:

    def run(self):
        df = load_boston()
        return dict(
            x=pd.DataFrame(df.data, columns=df.feature_names),
            y=pd.DataFrame(df.target, columns=['target'])
        )

class TrainModel:

    MODELS = {
        'ols': LinearRegression,
        'gbm': GradientBoostingRegressor,
    }

    def __init__(self, model, x, y):
        if model not in self.MODELS:
            raise ValueError(f'invalid model: {model}')
        self.model = model
        self.x = x
        self.y = y

    def run(self):
        model = self.MODELS[self.model]()
        model.fit(self.x, self.y)
        score = model.score(self.x, self.y)
        return {'score': score}

Step 2, define a data flow using Runflow.

# File: examples/ml_example.hcl

flow "machine_learning_example" {

  import {
    tasks = {
      extract = "examples.sklearn_refactored_script:ExtractTrainingSet"
      train_model = "examples.sklearn_refactored_script:TrainModel"
    }
  }

  task "extract" "setup" {
  }

  task "train_model" "model1" {
    model = "ols"
    x = task.extract.setup.x
    y = task.extract.setup.y
  }

  task "train_model" "model2" {
    model = "gbm"
    x = task.extract.setup.x
    y = task.extract.setup.y
  }

  task "file_write" "output" {
    filename = "/dev/stdout"
    content = tojson({
      scores = {
        ols = task.train_model.model1.score
        gbm = task.train_model.model2.score
      }
    }, { indent = 2 }...)
  }

}

Run:

$ python3 -mvenv venv
$ source venv/bin/activate
$ pip install runflow
$ python3 -mrunflow run examples/ml_example.hcl
[2021-07-06 23:15:19,999] "task.extract.setup" is started.
[2021-07-06 23:15:20,006] "task.extract.setup" is successful.
[2021-07-06 23:15:20,006] "task.train_model.model2" is started.
[2021-07-06 23:15:20,144] "task.train_model.model2" is successful.
[2021-07-06 23:15:20,144] "task.train_model.model1" is started.
[2021-07-06 23:15:20,151] "task.train_model.model1" is successful.
[2021-07-06 23:15:20,152] "task.file_write.output" is started.
{
  "scores": {
    "ols": 0.7406426641094095,
    "gbm": 0.9761405838418584
  }
}
[2021-07-06 23:15:20,153] "task.file_write.output" is successful.

Conclusion

Writing machine learning code in a spaghetti coding style may create problems for you.
Considering the complex dependencies, it's better to define them as a flow of tasks.
Runflow is one of the competitive solutions.