20
Towards Stabilization: Serialization Format(s) for PliantDb
Last week someone interested in using PliantDb
asked a question on our Discord server:
The current version is not yet at 1.0, and messages are everywhere to not use it, yet. What features aren't yet implemented or trusted?
Because Discord isn't a great way to archive these answers publicly, I wanted to share my response:
In terms of what's trusted: everything. I feel confident in this implementation because of the code coverage: https://pliantdb.dev/coverage -- It's not perfect, and I'm sure there are some bugs, but the biggest concern to me is storage formats. I may replace cbor with something else, for many reasons that I'll leave outside of chat here (Dax doesn't even know that thought process yet lol). This sort of fundamental storage change would make a simple update incompatible, and that's what I'm not ready for people to adopt PliantDb aggressively yet.
That being said, part of the unit tests do include testing backup/restore, and my intention is to ensure that export format will always be able to bring you from a previous version to a current version in those situations. The gotcha right now for that feature is that the key-value store isn't backed up currently. https://github.com/khonsulabs/pliantdb/issues/50 (Discovered I overlooked that feature while hooking up those unit tests).
Missing features that I'm aware of for local/embedded use: Collection's don't have a List function. You can list by creating a view over the collection, but I need to add a separate List endpoint. I started this the other day, but I was hoping to do it by replacing get_multiple. I realized that approach was a bad idea from a permissions standpoint, so I reverted the changes to tackle it another day.
For server/client: There isn't any multi-user support (yet). We're on the cusp of it. The certificate handling on the server portion for the QUIC protocol currently only supports pinned certificates -- the goal is for our HTTP + QUIC layers to eventually share the same certificate. For websockets, no TLS currently, and the websockets are mounted at root. Eventually they will be moved to a route on an HTTP layer that you will be able to extend with your own HTTP routes.
That question spurred my brain into action, though. A few weeks ago, I had begun looking into how to protect PliantDb
from memory exhaustion attacks. I knew bincode
's method, but my initial searches on mitigation strategies for serde-cbor
came up blank.
The summary of the results of my searches is that there is a question about whether the current serde-cbor
crate should be considered the mainline one, or if a newer one (Ciborium
) should replace it. I should note, I haven't tested either crate against this attack, and it could be that one or both of them already mitigate it somehow. And, if either are susceptible, pull requests could address the issue. But, I wasn't sure where my efforts to further investigate should be spent.
At this point, I wanted to remind myself why I picked the decisions I did. There are two types of data structures in use in PliantDb
that need to be serialized and deserialized: ones PliantDb
itself manages, and ones that users of PliantDb
will provide. This is where the power of serde
comes in: PliantDb
only needs the user types ot implement Serialize
and Deserialize
, and it's able to be easily stored in PliantDb
.
When considering storage formats for user types, it's important to think about one of the most important aspects of a database: storing and loading your data reliably. Generally speaking, a self-describing format is one that includes enough information that it can be loaded without having the original data structure for reference. bincode
has a note in its README discussing its limitations of using this type of format for storage.
In short, if I want PliantDb
to be easy to use in a reliable fashion, user datatypes should be enoded using a self-describing format. With CouchDB, a major inspiration for PliantDb
, documents were stored as JSON. However, JSON isn't a particularly efficient format, and in my research, CBOR
is an open-standard binary format with a reasonable amount of popularity in the Rust community.
For the internal PliantDb
structures, I am willing to subject myself to limitations on how to manage migrating between versions of data structures. Those structures I want to serialize as quickly as possible while still providing me some flexibility. bincode
fits this bill perfectly. While a custom format technically could be faster, bincode
is very fast and well-tested.
So, that's the reasoning as to why I picked CBOR
and bincode
. But, something rubbed me the wrong way about CBOR
and most other self-describing formats. This friction of wanting to solve the only outstanding question for the storage of PliantDb
's documents made me confront one of my only dislikes of the CBOR
format: its verbosity.
Imagine this data structure:
struct Logs {
date: Date,
entries: Vec<LogEntry>,
}
struct LogEntry {
timestamp: DateTime<Utc>,
level: String,
message: String,
// ...
}
When encoding this data structure with 50 entries
, the identifiers timestamp
, level
, and message
will be in the created file that many times.
As someone who has worked on a compiler that targeted multiple executable formats, I know one of the tricks of the trade: executables include a string table that contains all of the static strings in your binary. If you have the string literal "hello"
in your executable in 30 files, the compiler will encode the same address for each reference.
My theory was that by generating a string table for all of the identifiers in the data structures, I could easily gain efficiency on storage while hopefully retaining similar performance to CBOR
.
How beneficial would it be? Only one way to find out.
I started up a project the next day and lacking creativity, I named it PliantDb Binary Object Representation
, or PBOR
. While I named it after CBOR
, I genuinely came up with this format independently, and while it bears a resemblance, there are a few distinct features. First, let me state my goals explicitly upfront for this project:
- Implement a self-describing serialization format with "full" compatibility with
serde
's features. Essentially, design it to fitserde
's design like a glove. - Is safe to run in production: ensure it is safe against malicious payloads.
- Is compact: Should be more compact than
CBOR
. - Is efficient: Should be roughly equivalent to the current performance of
CBOR
.
So first, a quick discussion about the practicalities of having a string/identifier table. One downside is that you don't know the size of the table until the entire data set has been scanned. This creates a conundrum. If you want the table at the start of the output, you will need to either allocate an arbitrary amount of space and come back and patch it in, or you write it after the data and include a 'jump' target in a header (requiring the ability to seek backward over your written data).
The problem with both approaches is similar: the entire payload must be in-memory to efficiently process the data, either during serialization or deserialization. So, as I began to think about how to design the format, I started thinking about having a format that would allow me to output an identifier once, then in the future, refer to it by id.
This began highlighting a core design idea, that each chunk of data was going to have a kind
and an optional argument
. This turns out to be another way that CBOR
and my format differ. For CBOR
, the argument is always output as a second byte (or additional, depending on how big the integer value is). The way I tackled the problem requires slightly more work but appears to over-time save storage space.
Let's establish a new term: an Atom. In PBOR
an atom is an individual chunk of data. The first byte contains three pieces of information:
- Upper nibble (
& 0b11110000
): the Atom kind. - Fifth bit (
& 0b1000
): Additional bytes are part of the argument - Last 3 bits (
& 0b111
): the first 3 bits of the argument.
To parse the atom header, if the lower nibble is not 0, there is an argument. The last three bits are extracted, and then if the fifth bit is set, an additional byte is read. Next, the lower 7 bits are shifted into the proper location, and if the highest bit is set, the loop is continued. The maximum size for an argument is a u64
, which makes the maximum atom header weigh in at 10 bytes with this extra encoding.
However, this packing provides some interesting opportunities. The remaining three bits can hold a value of 0 through 7, and if needed, the argument can scale up to a u64. However, values that are less than 8 can be stored in a single-byte atom header.
Let's examine integers and floats. The most common sizes are all 8 bytes or less. So, if we subtracted 1 (and disallowed a 0-byte integer), all integers that are u64 or smaller will only require a single byte header to denote the atom is an integer of X bytes size.
With bytes and strings, the argument can be the length of the data. Small values would still fit within a single byte, but string or byte sequences that were less than 1,024 bytes long would fit within a two-byte header. Long story short, the loss of the single bit of encoding still allow most practical values to fit in one-byte-smaller encoding.
Finally, let's think about identifiers. In PBOR
there is an atom kind Symbol. When the serializer first encounters a new identifier, it will write an atom (Symbol, 0)
, followed by a string atom containing the identifier. The deserializer will expect a string when it receives an 0 in the atom header. Both the serializer and deserializer will assign it a new id, with the first one starting at 1 and counting upwards.
When the serializer encounters an identifier it has already serialized, it will emit the symbol ID as the atom argument. The deserializer will not expect a string when it receives a non-zero argument and instead will look up the already-deserialized string.
Here's where the arbitrary benchmark I've chosen (adapted from this project's log benchmark):
Library | Serialize (ms) | Deserialize (ms) | length | gzip length |
---|---|---|---|---|
bincode | 0.5757 | 2.3022 | 741,295 | 305,030 |
pbor | 2.1235 | 4.7786 | 983,437 | 373,654 |
serde-cbor | 1.4557 | 4.7311 | 1,407,835 | 407,372 |
serde-json | 3.2774 | 6.0356 | 1,827,461 | 474,358 |
These numbers should be considered arbitrary by anyone reading this. PBOR
is not a clear winner on any given metric, but it did achieve my primary goals.
Ultimately, going into the experiment I underestimated the cost of building and maintaining the symbol map both during serialization and deserialization. It took far too long to optimize it to be able to become equivalent on deserialization speed. I'm confident I can squeeze a little more performance out here or there, but I've stopped focusing on that for now. Instead, I wanted to openly ask: does this seem like a good idea, or should I just keep embracing CBOR
?
Unfortunately, to give realistic practical numbers, I'll need to take this experiment further, so I'm taking this moment to pause and reflect and make sure this goal is something worth spending time on.
One of the ways to prove its worth would be more benchmarks. But to benchmark the true impact to PliantDb
, we must consider how data flows through the database.
At the core, the Document type contains the serialized bytes of the document that is stored. This means that when saving to the database, the code connecting to the database is responsible for serialization: not the server. Thus, the penalty for serialization cost will live wherever the documents are being saved from.
If your View code deserializes the document on the server, the deserialization speed impacts that code's execution. However, this only affects the View updating processes and does not impact View queries themselves.
The server doesn't deserialize the documents for document fetching or view queries and simply sends the serialized bytes across. Thus, the format of the data on-disk directly impacts the amount of data transmitted across the network.
The last thing that I would find interesting to measure in real-world workloads is how often a document is serialized compared to deserialized. It seems reasonable to assume that the average document is deserialized more than once for each time it's serialized. Yet, not all data is the same -- many kinds of data are written and rarely read from, such as application logs.
Because of this mixture of "who pays the cost", there may ultimately not be a correct answer. My gut thinks that PBOR
is an interesting option, but there are significant benefits to using an open standard like CBOR
. I don't believe either choice will significantly affect the performance of PliantDb
servers. Finishing up PBOR
would require several more days to flush out unit testing and benchmarks and a few rough edges.
As such, I'm seeking input! I'd love to hear what your thoughts are for the self-describing format support. Here are the three options as I see them, but please leave a comment if you have other ideas.
- Stick with
CBOR
-
PBOR
sounds worth pursuing further -
PliantDb
shouldn't have one enabled by default, and users should be able to pick via feature flags. Clients and servers should be able to support multiple formats at the same time.
I'm running a poll on this post on our Discourse forums, but I would love feedback in whatever way is the easiest for you to provide it.
Keep in mind that this only impacts the built-in deserialization methods. You can always interact with the document contents directly to use your own libraries.
Thank you in advance for any feedback!
20