Kedro: The Best Python Framework for Data Science!!!

Kedro: Python Framework for data sciences!

Obs: This post was first posted on Medium

This publication is a summary with some examples about the Python Kedro Framework, being open-source created by Quantumblack, widely used to code in Python in a reproducible, sustainable and modular way to create “batch” pipelines with several “steps”.

This Framework has been increasingly gaining space and being adopted by the community, especially when it is necessary to create a sequential execution “mat” with different steps; this fact has happened mainly in the development of codes focused on data science due to its ease of use together Python code, rich documentation, as well as being extremely simple and intuitive.

Installation:

Using the PIP:

pip install kedro

Using Anaconda:

conda install -c conda-forge kedro

Elements of Kedro:

  • Node:

“It is a wrapper for a Python function that names the inputs and outputs of that function” that is; it is a block of code that can direct the execution of a certain sequence of codes or even other blocks.

# importing the library
from kedro.pipeline import node

# Preparing the first "node"
def return_greeting():
    return "Hello"

#defining the node that will return
return_greeting_node = node(func=return_greeting, inputs=None, outputs="my_salutation")
  • Pipeline:

“A pipeline organizes the dependencies and order of execution of a collection of nodes and connects inputs and outputs, maintaining its modular code. The pipeline determines the order of execution of the node by resolving dependencies, and does not necessarily execute the nodes in the order in which they are transmitted. ”

#importing the library
from kedro.pipeline import Pipeline

# Assigning "nodes" to "pipeline"
pipeline = Pipeline([return_greeting_node, join_statements_node])
  • DataCatalog:

“A DataCatalog is a Kedro concept. It is the record of all data sources that the project can use. It maps the names of node inputs and outputs as keys in a DataSet, which is a Kedro class that can be specialized for different types of data storage. Kedro uses a MemoryDataSet for data that is simply stored in memory. ”

#importing the library
from kedro.io import DataCatalog, MemoryDataSet

# Preparing the "data catalog"
data_catalog = DataCatalog({"my_salutation": MemoryDataSet()})
  • Runner:

The Runner is an object that runs the pipeline. Kedro resolves the order in which the nodes are executed:

  1. Kedro first performs return_greeting_node. This performs return_greeting, which receives no input, but produces the string “Hello”.

  2. The output string is stored in the MemoryDataSet called my_salutation. Kedro then executes the second node, join_statements_node.

  3. This loads the my_salutation dataset and injects it into the join_statements function.

  4. The function joins the incoming greeting with “Kedro!” to form the output string “Hello Kedro!”

  5. The pipeline output is returned in a dictionary with the key my_message.

Example: “hello_kedro.py”

"""Contents of the hello_kedro.py file"""
from kedro.io import DataCatalog, MemoryDataSet
from kedro.pipeline import node, Pipeline
from kedro.runner import SequentialRunner

# Prepare the "data catalog"
data_catalog = DataCatalog({"my_salutation": MemoryDataSet()})

# Prepare the first "node"
def return_greeting():
    return "Hello"return_greeting_node = node(return_greeting, inputs=None, outputs="my_salutation")

# Prepare the second "node"
def join_statements(greeting):
    return f"{greeting} Kedro!"join_statements_node = node(
    join_statements, inputs="my_salutation", outputs="my_message"
)

# Assign "nodes" to a "pipeline"
pipeline = Pipeline([return_greeting_node, join_statements_node])

# Create a "runner" to run the "pipeline"
runner = SequentialRunner()

# Execute a pipeline
print(runner.run(pipeline, data_catalog))

To execute the code above, just use the command below in the terminal:

python hello_kedro.py

The following message will appear on the console:

{‘my_message’: ‘Hello Kedro!’}

Another way to visualize the execution of pipelines in kedro is using the kedro-viz plugin:

Happy Integrations:

In addition to being very easy to use with Python, kedro has some integrations with other tools, solutions, and environments, some of which are:

  • Kedro + Airflow:

Using Astronomer it is possible to deploy and manage a kedro pipeline using Apache Airflow as if they were DAG’s:

  • Kedro + Prefect:

We were also able to deploy and manage a Kedro pipeline in the Prefect Core environment:

  • Kedro + Google Cloud DataProc:

We were able to use kedro within the Google cloud Dataproc:

  • Kedro + AWS BATCH:

To deploy in the AWS environment we can use the AWS Batch:

  • Kedro + Amazon SageMaker:

It is possible to integrate Kedro with Amazon SageMaker:

  • Kedro + PySpark:

With Kedro we can simplify configurations of deploying pipelines using Apache Spark via PySpark to centralize Spark configurations such as memory usage, manage Spark sections and contexts, and even where to populate Populated Dataframes:

  • Kedro + Databricks:

It is possible to deploy a Kedro pipeline and use it within a Databricks cluster easily:

  • kedro + Argo Workflows:

We have the possibility to do an automated deployment in a container environment such as Red Hat Openshift, Kubernetes and many others, using Argo Workflows:

  • kedro + Kubeflow:

To deploy in container environments, we also have the option of using kubeflow integration:

This publication was a brief summary of Kedro, its components, and some interesting integrations, what did you think? Have you used it?

Follow me on Medium :)

19