21
Building ETL Pipelines in Python: Part 1
Data engineering refers to the development of software that performs three tasks:
- Extract raw data from various sources.
- Transform and clean up the data.
- Load the data into a data warehouse, i.e., a database management system like, Google BigQuery or Amazon Redshift. Alternatively,load the data into MySQL, a relational database.
The process is complex involving numerous software tools and advanced technical expertise. For one thing is can be challenging and overwhelming to identify the starting point and resources from which to acquire the basic knowledge. No wonder the demand for data engineers in the technology sector has risen by an order of magnitude in the past years. As the need for data generation grows, the thirst for data engineers will continue to intensify.
This post, part 1 of a three-part tutorial shows you how to build a simple ETL[extract,transform,and load] pipeline with CSV files in Python, describes the extraction steps.
Software Requirements
You will need the following software:
glob
, a package that is preinstalled that searches for files in relative paths.
For details, see this section for documentation.pandas
, a package used for data manipulation. To install, type on the command line,pip install pandas
.
For details, see the pandas site.virtualenvwrapper
, a package that sets up virtual environments for Python. For details, see the related documentation.This Kaggle dataset for the CSV data.
Here's an example of CSV data on car sales:
The Procedure:
Create a project called
etl_car_sales
with PyCharm.Create a virtual environment with the command line
mkvirtualenv etl_car_sales
.Install
pandas
andvirtualenwrapper
.Extract the zip file and move the CSV files for
car_sales
to youretl_car_sales
directory, like this:
- Create an
extract_from_csv
function to read the CSV file in your main directory and then return a Dataframe with thepandas
functionread_csv
.
- Create an
extract
function to read the data from the extracted csv file and append the data to the Dataframe that is being return theextract_from_csv
function.
Tip: To rename the columns, do the following:
- Print out the
extracted_data
variable and the result would read like this:
Part 2 will describe how to transform data in the ETL pipeline. Stay Tuned.
21