What does it mean to scale your python powered ML pipeline?

Scaling, in general, is starting out with a small solution and growing it into a bigger solution.

Different people mean different things when they talk about scaling. Do you want to scale a mountain, a web app, or an ML pipeline?

Even when talking about machine learning, scaling can mean different things:

Processing more data
Using a more complex model
Running more often

Scaling comes at a cost

Scaling requires resources and cost to execute. The cost can be in the form of:

Developer time required for code refactoring and optimization for performance
Performance optimized hardware for accelerated computing
Distributed processing requires more computers
Longer wait time for processing more on the same infrastructure

Think before you scale

There are two important things to consider before you invest in scaling:

What are your goals? You must decide beforehand, what is good enough? Over-engineering is a trap, all engineers are familiar with. Don't over-engineer!
Measure your code? To treat a patient, a doctor has to determine what the illness is? Likewise, you need to measure your code for bottlenecks before deciding on a scaling strategy.
There are many tools that can use to profile your python code. Scalene is one of the most complete measuring tools available at the moment.

How to scale?

Increase code efficiency: Python is designed for ease of use and easy extension, but not performance. As a developer, the onus is on you to do more work so that the application executes less code. Whenever possible use vectorized library functions instead of loops.
Python is successful in data science because of the pre-compiled code offered by data-appropriate libraries in the pydata stack such as pandas and numpy.
Adding parallelism: You can achieve parallelism on a single machine or by using multiple machines. In principle, you want to stay on a single machine as much as possible as distributed computing increased complexity because of the following challenges:
- network transfer
- task management
- cluster management

2a. Hardware accelerators: You can leverage GPUs and
TPUs for your ML workflows. Libraries such as CuPy and
RAPIDS leverage GPUs. There are non-GPU accelerators also
like TPUs offered by Google and Apple Neural Engine.

2b. Distributed Computing: Going distributed is much
harder for the aforementioned reasons but if you have to
then you can look at the following projects:

Dask: Distributed data frames, machine learning and more
Ray: Parallel python framework
Horovod: Uber's framework for distributed deep learning
PySpark: Python wrapper for Apache Spark ecosystem

Whichever direction you take on your scaling journey, be sure to pause before you commit, and solve one problem at a time.

P.S. I'll cover I/O bottlenecks in a separate post.

دسته بندی ها:

Python

برچسب ها:

Python Machinelearning Datascience

پست های مشابه

NodeJS vs Python

App Development Practices To Follow For Your Business In 2021

Creating a Podcast of book summaries and articles in PDF in Portuguese with AI in Python!

Getting started with Docker & Flask.

21 must-read books for kids, beginner, advanced & interview with PDF 🐍

Weekly Challenge 140

Logistic Regression

OOP in Python