Tutorial - Building a database with LMDB - Part1/Architecture

In the article LMDB - Faster NoSQL than MongoDB I showed how LMDB can be used to achieve significantly higher performance when compared to MongoDB. A lot of this speed is coming from LMDB being a memory-mapped database and various optimisations that allow it to make optimal use of the OS's buffer cache and CPU's L1 cache structures.
But what I didn't cover was how to use it in a real-life application. So, let's do just that.

This series is going to go over the architecture side of building a custom database solution, and why you might want to actually take the plunge.

When does it make sense?

When designing a database from the ground up, as with most significant projects, it's important to understand what features are important and what isn't. This starts with understanding why you're building a database.

Here are some good reasons:

Pedagogical / Educational experience

Strict performance requirements (at the expense of features and development time)

However, if your reasons are in this next list, you might want to reconsider:

"I can do it better"

Why does this matter, LMDB is already a database, isn't it?

LMDB is a database in the same sense that SQLite3 is a database: It has ACID Transactions, It keeps a copy on disk, It is crash resilient, It serialises writes.

But it doesn't have a remote interface - no sockets, no ports; no network support of any kind. Meaning that if you need any of that then you need to build it.

Out of the box, LMDB could be described as a hashmap with ACID transactions.

This means: If you need multiple servers to connect to it, then you need to build that; If you need backups then you need to build that; If you need sub-object indexing then you need to build that too.

So what are we going to build?

Let's assume the following: We're building a measurement platform where our edge nodes receive UDP packets from the apparatus containing: a single sample, an identifier, and a timestamp. The edge nodes need to ship aggregated measurements to a centralised API periodically.

Let's start by considering the edge nodes:

When they receive a UDP packet they need to reflect the timestamp to the apparatus as soon as it's recorded (kind of like an ACK, but lazier). We want to do this as quickly as possible because the apparatus is busy waiting and will re-transmit if it hasn't received this ACK within a very short period.

So then let's define the requirements for our edge nodes:

The system must ensure the safe storage of all samples

The system must acknowledge each sample as quickly as possible

The system must deduplicate any re-transmissions

The system must periodically send aggregated copies of the sample data upstream

We can model these requirements as two processes connected by a database:

The left-hand process acts as a server and receives the packet, inserts it using the identifier+timestamp as the key, appends this key to the index of samples in this time window, and finally then sends the ACK.

The right-hand process periodically wakes up, reads the index, collates the samples, and makes an HTTP POST (containing all of the samples from this time window) to the upstream API.

There are, of course, already products for this, Apache Pulsar, Redis, and RabbitMQ all spring to mind as potential solutions to this task. But that's not why you're here, so we'll assume that there are reasons that you don't want to build on top of any of those.

Keep an eye open for Part 2, where we’ll cover Data Structures and CTypes.
See you there!