Use Scikit-Learn and Runflow

If you're not familiar with Scikit-learn and Runflow,
  • Scikit-learn is a simple and efficient tool for predictive data analysis.
  • Runflow is a tool to define and run workflows.
  • By mix using both, your machine learning code will be organized better.
    Why Runflow is Needed?
    If you just simply follow the code snippets shown on scikit-learn documentation,
    you will quickly get into some issues.
  • With more complexity added, the code doesn't scale well.
  • The code is hard to maintain and read.
  • You need to deal with where to load and save the data.
  • You need to track the change of models and parameters over time.
  • Code change in the middle of the script may break the following code. Spend a lot of time troubleshooting.
  • How to Improve Your Machine Learning Code?
    Let's see how Runflow solves this issue.
    Step 1, split your code into a minimal chunk of classes and functions so they're easy to be re-used. And more importantly, the error is contained in a scope.
    # File: examples/sklearn_refactored_script.py
    from sklearn.datasets import load_boston
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.linear_model import LinearRegression
    import pandas as pd
    
    class ExtractTrainingSet:
    
        def run(self):
            df = load_boston()
            return dict(
                x=pd.DataFrame(df.data, columns=df.feature_names),
                y=pd.DataFrame(df.target, columns=['target'])
            )
    
    class TrainModel:
    
        MODELS = {
            'ols': LinearRegression,
            'gbm': GradientBoostingRegressor,
        }
    
        def __init__(self, model, x, y):
            if model not in self.MODELS:
                raise ValueError(f'invalid model: {model}')
            self.model = model
            self.x = x
            self.y = y
    
        def run(self):
            model = self.MODELS[self.model]()
            model.fit(self.x, self.y)
            score = model.score(self.x, self.y)
            return {'score': score}
    Step 2, define a data flow using Runflow.
    # File: examples/ml_example.hcl
    
    flow "machine_learning_example" {
    
      import {
        tasks = {
          extract = "examples.sklearn_refactored_script:ExtractTrainingSet"
          train_model = "examples.sklearn_refactored_script:TrainModel"
        }
      }
    
      task "extract" "setup" {
      }
    
      task "train_model" "model1" {
        model = "ols"
        x = task.extract.setup.x
        y = task.extract.setup.y
      }
    
      task "train_model" "model2" {
        model = "gbm"
        x = task.extract.setup.x
        y = task.extract.setup.y
      }
    
      task "file_write" "output" {
        filename = "/dev/stdout"
        content = tojson({
          scores = {
            ols = task.train_model.model1.score
            gbm = task.train_model.model2.score
          }
        }, { indent = 2 }...)
      }
    
    }
    Run:
    $ python3 -mvenv venv
    $ source venv/bin/activate
    $ pip install runflow
    $ python3 -mrunflow run examples/ml_example.hcl
    [2021-07-06 23:15:19,999] "task.extract.setup" is started.
    [2021-07-06 23:15:20,006] "task.extract.setup" is successful.
    [2021-07-06 23:15:20,006] "task.train_model.model2" is started.
    [2021-07-06 23:15:20,144] "task.train_model.model2" is successful.
    [2021-07-06 23:15:20,144] "task.train_model.model1" is started.
    [2021-07-06 23:15:20,151] "task.train_model.model1" is successful.
    [2021-07-06 23:15:20,152] "task.file_write.output" is started.
    {
      "scores": {
        "ols": 0.7406426641094095,
        "gbm": 0.9761405838418584
      }
    }
    [2021-07-06 23:15:20,153] "task.file_write.output" is successful.
    Conclusion
    Writing machine learning code in a spaghetti coding style may create problems for you.
    Considering the complex dependencies, it's better to define them as a flow of tasks.
    Runflow is one of the competitive solutions.

    33

    This website collects cookies to deliver better user experience

    Use Scikit-Learn and Runflow