14
Cloud computing quickstart for data engineering
Cloud computing is the use of a network of remote servers hosted on the internet to store, manage and process data.
- no need to invest in hardware upfront
- rapid provisioning of resources
- provides efficient global access through deployments in different regions.
Cloud providers are Amazon, Microsoft, Google, Alibaba, Oracle and IBM. As Amazon is the biggest one, we are going to get an overview to get the basics needed for data engineering.
AWS offers more than 140 services for computation, storage, databases, networking and development tools.
The services can be accessed in 3 ways:
- AWS MGMT console: https://console.aws.amazon.com/ - The webapp
- AWS CLI: https://aws.amazon.com/cli/ - The command line interface
- SDK's: https://aws.amazon.com/tools/ - Software development kits. Available in a lot of programming languages. The advantage of using IaC - Infrastructure as code are sharing, reproducibility, multiple deployments and maintainability. For development with python we can use the famous boto3.
As there are over hundred services available, you might be overwhelmed at first sight. In order to make the start a bit easier we create a glossary with the services you will need for data engineering and the according links to their documentation. As there are a lot more services than the ones mentioned below, feel free to dive deeper into the AWS documentation here.
A user is an entity, person or application that interacts with AWS.
A role can be assigned to anyone who needs it. It is not uniquely connected to an entity.
VPC - Virtual Private Cloud
Enables to launch AWS resources in a virtual network defined by your needs. It is a data center with the benefits of cloud infrastructure.
It can store, retrieve and access any amount of objects at any time in buckets. Depending on the need there are a lot of different storage classes.
S3 Buckets
A bucket is a container for objects. There are a lot of useful properties like:
- Versioning: keep multiple versions of an object in the same bucket
- Static website hosting: a very cost-effective way to serve static web content
- Requester pays: makes the requester pay for requests and data transfer costs
- Permission management
- Data management: create lifecycle rules, transitioning data, archive or delete data
- Metrics for usage, request, data transfer, bucket size, number of objects
- Access points: Create access points to share the bucket at scale
S3 Objects
An object is a file and any meta that describes that file.
EC2 - Elastic Cloud Compute
A web service that provides secure, resizable compute capacity in the cloud. If we want to use the cloud self-managed we can use EC2 + Postgresql, EC2 + Unix FS instead of Amazon RDS or Amazon DynamoDB and Amazon S3.
A relational database service that manages common database administration tasks, resizes automatically, and is cost-friendly.
- it is a column-oriented storage
- MPP (massive parallel processing) database
- good to store OLAP workloads, summing over a long history
- internally it is a modified postgresql
14