Cloud computing quickstart for data engineering

What

Cloud computing is the use of a network of remote servers hosted on the internet to store, manage and process data.

no need to invest in hardware upfront

rapid provisioning of resources

provides efficient global access through deployments in different regions.

Cloud providers are Amazon, Microsoft, Google, Alibaba, Oracle and IBM. As Amazon is the biggest one, we are going to get an overview to get the basics needed for data engineering.

AWS - Amazon Web Services

AWS offers more than 140 services for computation, storage, databases, networking and development tools.
The services can be accessed in 3 ways:

AWS MGMT console: https://console.aws.amazon.com/ - The webapp

AWS CLI: https://aws.amazon.com/cli/ - The command line interface

SDK's: https://aws.amazon.com/tools/ - Software development kits. Available in a lot of programming languages. The advantage of using IaC - Infrastructure as code are sharing, reproducibility, multiple deployments and maintainability. For development with python we can use the famous boto3.

As there are over hundred services available, you might be overwhelmed at first sight. In order to make the start a bit easier we create a glossary with the services you will need for data engineering and the according links to their documentation. As there are a lot more services than the ones mentioned below, feel free to dive deeper into the AWS documentation here.

IAM - Identity and Access Management

User

A user is an entity, person or application that interacts with AWS.

Role

A role can be assigned to anyone who needs it. It is not uniquely connected to an entity.

VPC - Virtual Private Cloud

Enables to launch AWS resources in a virtual network defined by your needs. It is a data center with the benefits of cloud infrastructure.

S3 - Simple Storage Service

It can store, retrieve and access any amount of objects at any time in buckets. Depending on the need there are a lot of different storage classes.

S3 Buckets

A bucket is a container for objects. There are a lot of useful properties like:

Versioning: keep multiple versions of an object in the same bucket

Static website hosting: a very cost-effective way to serve static web content

Requester pays: makes the requester pay for requests and data transfer costs

Permission management

Data management: create lifecycle rules, transitioning data, archive or delete data

Metrics for usage, request, data transfer, bucket size, number of objects

Access points: Create access points to share the bucket at scale

S3 Objects

An object is a file and any meta that describes that file.

EC2 - Elastic Cloud Compute

A web service that provides secure, resizable compute capacity in the cloud. If we want to use the cloud self-managed we can use EC2 + Postgresql, EC2 + Unix FS instead of Amazon RDS or Amazon DynamoDB and Amazon S3.

RDS - Relational Database Service

A relational database service that manages common database administration tasks, resizes automatically, and is cost-friendly.

Redshift

it is a column-oriented storage

MPP (massive parallel processing) database

good to store OLAP workloads, summing over a long history

internally it is a modified postgresql

IaC - Infrastructure as Code Example with boto3