MLOps journey with AWS - part 1 (helicopter view)

in this series of posts, we are going to start our journey into building MLops culture foundations and will see how to benefit from the help of AWS services for productizing our ml projects.
since this is the first article I want to set the foundations and also to give a full view of the tools or tech we need for our journey in building end-to-end CI/CD/CT pipelines for ML projects.

first, let's take a closer look at a general ML/Data projects life cycle

as you can see there are 4 main phases :

phase-1: gathering, ingestion, extraction of data

phase-2: data exploring, understanding, cleaning, transforming, and pre-processing ( why I put this in one phase is coz the more you understand the data the better you can pre-process it and feature engineer it )

phase-3: modeling ( training and validation )

phase-4: serving

according to above a simple ML project story:
team (Ali is a data scientist, jack is ml researcher, mamon is ml engineer )
Ali gathered some data explored it and you did cleaning preprocessing and feature engineering on the data and delivered it to jack.
jack started to split data and do modeling and training and benchmarking now jack delivers a trained model with its artifact to mamon.
mamon validates the model builds the serving service and pushes the model to production.

now, what is wrong with the above lifecycle.

the above is not a production-ready solution ?!!
here are some notes of why it is not a production-ready

1- is inefficient, not scalable
2- is not take into account data drift, and no monitoring and clear visibility on the process from data to modeling
3- what about the integration with the software team that will consume the ml service and use it
4- a lot of manual delivery and cuts between each process

but wait second what you mean by drifting ??!!!

In practice, models often fail to adapt to changes in the environment or in data, or some times for any reason we want to re-train or re-finetune or re-deploy models to adapt to changes so we need to have monitor phase and re-train pipeline and deploy pipeline with strategies so we can respond quickly to any kind of data or model drift.

for short I will give 2 main types but there are other types :

  • Data drift can happen when there is a change in data distribution from the data that the model trained on and depending on that change or that new patterns there will be
    decreasing of model accuracy and performance

  • another type of drift is concept drift that when the relationship between features and y the label is changed

that topic is very dependent on your use case the frequency depends on application nature and the trigger depends on the quality target that if it can be measured without a human in the loop or it needs a human in the loop don't worry I will make a dedicated article for cover that topic with more details and with how to handle it using AWS

in a few words is that there are new trends new patterns that happened we need our model to catchup with or the whole hypothesis have changed we need to re-architecture the model or models

so how MLops can help the above small team if the team want to scale and increase productivity while using best practices

What is MLops :

it is an engineering culture and practice to unify ML system development (ML Dev) and ML system operation (Ops). by this, we need to increase visibility and reduce manual steps and increase team collaboration and increase the speed of both rolling out and responding to new changes.
This means Automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management.

why MLops and not Devops (Deference Between MLOps & DevOps)

the nature of projects and teams involved in each one of the above (MLOps, DevOps ) make a need for extending the ideas, the vision, and tools of DevOps to be fit with ML / Data projects.

1- the team in ML projects usually includes data scientists or ML researchers these members might not be experienced software engineers who can build production-class services.

2- ML is experimental. we need tracking and maintaining reproducibility while maximizing code reusability.

3- Testing an ML system need data validation, trained model evaluation, and validation.

4- ML models can have reduced performance not only coz of problems in coding, but also due to drifting (that mentioned above )

now what kind of extension need to be on CI/CD to be fit with our need for MLOps

  • Continuous Integration (CI): extend the testing and validating code and components and add testing and validating data, data schemas, and models.

  • Continuous Delivery (CD): extending the service packaging and adding an ML training pipeline that should automatically deploy another service called model prediction service.

  • Continuous Training (CT): automatically retrains ML models for re-deployment (unique to ML systems property).put in mind that re-training triggers happen for reasons and are not 100% automatic and in many cases need a human in the loop for the decision of such triggers so it case-based approach but you must have a ready training pipeline with triggers

  • Continuous Monitoring (CM) monitoring production data and model performance in terms of business metrics.

some goals we want to or try to achieve when applying MLops

1- team collaboration and ease of tracking and re-producing for (code, data, models)
2- easy to share preprocessed data across teams, projects, or even moving between data channels like test and train
3- good visibility and motioning in all the way (from data to serving to even feedbacks from the end client of the final software )
4- easy to make decisions or triggering for any part such as triggering re-training, processing

tools and tech to achive some or all of the above goals

1- feature store: here it will take care of automating some of Ali work ( preprocessing and feature engineering )
single source of truth to store, retrieve, remove, track, share, discover, and control access to features.
and you can use aws sagemaker feature store for that

2- experiment tracking and model registry: this will help jack in his experiments and will help mamon in reaching models while pushing to production, and here we can use aws sagemaker model registry

3- training pipeline: no need for jack to re-execute the training phase if a model decay happen a Trigger can start re-training using the training pipeline defined by jack, you can use aws sagemaker pipeline, also you can use AWS lambda, you can use TFX, TorchX , kubeflow

4- serving pipeline: base of production strategy this pipeline can help mamon from checking which model to validate and to re-package and redeploy the serving services, aws sagemaker pipeline, also you can use aws lambda, you can use TFX , TorchX, kubeflow

4- data and model monitoring for drifting and decay with triggers: here we can use aws sagemaker pipeline with model monitoring

5- when we talk about pipeline we need orchestration and also we need automation aws sagemaker pipeline will manage that for you behind the scenes

how we we can achieve all of the above in AWS a quick video

14