Methods of Data Storage in Data Science You Should Know

Photo by Art Wall - Kittenprint on Unsplash

Introduction

We live in extraordinary times where technology is part of everything. One of the things that have made technology so powerful and valuable is data. A vast amount of data is generated every day on the internet. Consider the data generated on Instagram alone:

500 million Instagram stories are uploaded everyday

3.5 billion likes are recorded daily

As of 2016, 95 million posts are made every day, and Instagram usage doubled between 2016 and 2018.

This massive amount of data (called big data because of the size) cannot be stored using traditional techniques or on one machine. Storing and retrieving big data requires multiple interconnected machines. This article focuses on the different techniques of storing big data and their pros, cons, and use cases.

Types of Data

Before diving into the different types of data storage, we must first look at the types of data.

Structured Data

Human

Filling your details or information on a form

Behaviour while watching videos online. For example, the site tracks the timestamp/episode you left the video at last.

Machine

Sensor data from your IoT devices or smartphones like your GPS location, battery life, and so on

Application logs, which help engineers figure out what errors exist in their platform or how efficient their system works.

Structured data accounts for about 20% of all the data on the internet and is usually easy to analyze to derive meaningful insights.

Unstructured Data

Human

Posting that morning sunshine image on Instagram

Updating your WhatsApp status with a comic video

Live streaming on social media platforms

The last PDF you exported

Machine

Images of the earth taken from a satellite

Analyzing unstructured data is more complex and usually involves creating machine learning models. This data comprises about 80% of the data on the internet and has become easier to store and maintain.

Image Source

Semi-structured Data

I like to think of this type of data as a kind of hybrid of both the above-mentioned types. It usually contains a structured part and an unstructured part.

Some examples of semi-structured data are emails and word documents. Notice these contain text (structured data) and media files (unstructured data). Think now of the data stored in NoSQL DBs in which one item has a certain number of fields, but the next does not—that is a semi-structured dataset. Notice that an email has the to, from, cc, and bcc fields, which are structured data, but it may also contain images.

How to Store Data

Now that we have familiarised ourselves with the different types of data, we can look at the various methods of storing this data.

File Storage

Files can simultaneously be read and written.

Only users on the same network can access them.

It is not easy to replicate your data—you might lose data in case of machine failure.

Block Storage

In block storage, data is stored as a contiguous chunk (called a block). It allows us to spread our network easily across different networks, and you don’t need to know where it is stored on the disk to access it. Relational databases are examples of RDMS, making it the best type of storage for structured data.

Other characteristics include:
They are highly scalable: You can increase the size of your block storage by adding more nodes to your network, thus making it easy to scale.
Easy to replicate: Most block storage services are easy to backup/replicate. Thus, in case of machine failure, your data is still intact.
Reads and writes are fast, and you do not need to know where the data is on disk.
Block storage is generally expensive as the data increases. This makes it expensive for data that is large, for instance.

Object Storage

An object is made up of 3 main parts:

The data: This is the image, picture, etc., that we want to store.

Metadata: This can be a description of what this data stands for. Most object storage services allow us to search the content of this metadata.

Unique Identifier: This can be used to retrieve the object easily at any point.

Anytime an object is created, most services replicate it three times. This makes it easy to retrieve the object and search for the data. Thus, object storage is ideal for storing unstructured data as it can easily be accessed using an ID.

Advantages of object storage include:

Way cheaper than other types of storage

Data replication comes in by default

Rich metadata feature

The storage is not coupled to a machine and thus is easy to scale

Grouping of data can be done using a metadata value

Conclusion

In this article, we looked at the different types of data in big data and explored the main techniques of storing big data. I hope you learn something new from this article.