19
Cloud computing quickstart for data engineering
Cloud computing is the use of a network of remote servers hosted on the internet to store, manage and process data.
Cloud providers are Amazon, Microsoft, Google, Alibaba, Oracle and IBM. As Amazon is the biggest one, we are going to get an overview to get the basics needed for data engineering.
AWS offers more than 140 services for computation, storage, databases, networking and development tools.
The services can be accessed in 3 ways:
The services can be accessed in 3 ways:
As there are over hundred services available, you might be overwhelmed at first sight. In order to make the start a bit easier we create a glossary with the services you will need for data engineering and the according links to their documentation. As there are a lot more services than the ones mentioned below, feel free to dive deeper into the AWS documentation here.
A user is an entity, person or application that interacts with AWS.
A role can be assigned to anyone who needs it. It is not uniquely connected to an entity.
VPC - Virtual Private Cloud
Enables to launch AWS resources in a virtual network defined by your needs. It is a data center with the benefits of cloud infrastructure.
It can store, retrieve and access any amount of objects at any time in buckets. Depending on the need there are a lot of different storage classes.
S3 Buckets
A bucket is a container for objects. There are a lot of useful properties like:
S3 Objects
An object is a file and any meta that describes that file.
EC2 - Elastic Cloud Compute
A web service that provides secure, resizable compute capacity in the cloud. If we want to use the cloud self-managed we can use EC2 + Postgresql, EC2 + Unix FS instead of Amazon RDS or Amazon DynamoDB and Amazon S3.
A relational database service that manages common database administration tasks, resizes automatically, and is cost-friendly.
19