Azure Cosmos DB | Mongo DB - Embedding vs Reference

Schema-less Databases similar to Mongo DB help us design models, store and query data easily and rapidly. But it is very important to understand, design and create the right schema design for your application which has great impact on the performance, scalability, costs etc.

Below are the key factors we need to consider before start designing our data models in Mongo DB.

Understand the difference between Normalized/Relational database to a Schema-less Mongo/Azure Cosmos DB
Is our application read or write heavy
How to model data in a schema-less database?
In which scenarios we need to embed data and which scenarios we need to refer to data?

Embedding data model pattern

As a developer/architect, when we start working in a schema-less database, we always tend to design schema similar to relational or normalized database. We would like to design the data into multiple tables as we traditionally design in a SQL normalized database but we would miss the great advantages of Mongo DB.

So it is better to understand the difference between traditional Normalized/Relational Database and Schema-less database.

For example, in a relational schema design, developers design the schema independent of queries. will normalize the data into multiple entities to avoid storing redundant data on each record and rather refer to the data in the related entities.

In the below example, it illustrates how we model order data in a relational database.

Order Data - Schema Design in Relational Database

To query the order, item details, contact details, etc, we need to make joins to other tables and fetch the data.

In the same manner, to update a single order item details, we need to update multiple tables.

Let's see how we can design the same order data model in Mongo / Azure Cosmos DB.

{ 
“id”: “1”, 
“orderdate”: “02/08/2021”,
“tax” : “8”,
“subtotalbeforetax”: “69”,
“shipmentdate”: “03/08/2021”,
“orderitems”: [ 
 { 
  “itemname”: “item1”,
  “quantity”: “2”,
  “itemprice”: “12”,
  “totalprice”:”24" 
 },
 { “itemname”: “item2”,
   “quantity”: “3”,
   “itemprice”: “15”,
   “totalprice”: “45”
 } 
],
“shippingcontact”: [ 
 {
   “name”: “<<person1>>”,
   “street”: “<<street1>>”,
   “city”: “<<city1>>”, 
   “state”: “<<state1>>”,
   “country”: “<<country1>>”, 
   “zipcode”: “<<zipcode1>>”,
   “phone”: “<<street1>>”
 },
]
 “billingcontact”: [ 
 {
  “name”: “<<person1>>”,
  “street”: “<<street1>>”,
  “city”: “<<city1>>”, 
  “state”: “<<state1>>”,
  “country”: “<<country1>>”,
  “zipcode”: “<<zipcode1>>”,
  “phone”: “<<street1>>”
 },
] 
}

In the above json document , we have denormalized order data by embedding all the data related to the order such as line item details, shipping and billing contact details etc., into a single document.

We are also flexible to change any fields or the sub objects/arrays format entirely anytime.

Now we can retrieve the complete order details in a single query/ read operation against a single embedded document.

Same way, updating the order with the item details and shipping information also can be done in a single update/write operation against the single order document.

In general, it is always recommended to go for Embed. Except for some specific cases where we need to go for Reference. Embedding also improves query-read performance.

Reference data model pattern

As long as we have less data to embed, we are good with Embedding Schema design. For one-one and one-few relationship entities, Embedding Data model pattern is the best choice.

But if we have too much data to embed, for one-many relationship entities where the child documents can grow above the limit or where the data might change frequently, it is better to go Referencing data model pattern.

For example, a library product catalog can have "n" number of book items which keep changing on a daily basis and it can experience growth regularly.

Product_Catalog:

“Product_Catolog” : 
{ 
  “id”: “1”, 
  “libraryname”: “<<libraryname>>”, 
  “product_catalog_no”: “1234”, 
  “books”: [“BookId(‘1111’)”, “BookId(‘2222’)”, “BookId(‘3333’)”] 
}

Books:

“books” :[ 
 { 
   “_id” : “Id(‘1111’)”, 
   “title” : “Book 1111”, 
   “author” : “Author 1111”, 
   “qty”: “10”, 
   “price”:” 24.99" 
 } 
 { 
   “_id” : “Id(‘2222’)”, 
   “title” : “Book 2222”, 
   “author” : “Author 2222”, 
   “qty”: “15”, 
   “price”:” 30.99" 
 } 
 { 
   “_id” : “Id(‘3333’)”, 
   “title” : “Book 3333”, 
   “author” : “Author 3333”, 
   “qty”: “20”, 
   “price”:” 14.99" 
 } 
]

In the above example, our book documents are independent of the parent product_catalog document. Any changes to the book documents can be updated separately.

Another key examples for frequent data updates are weather, stock exchange etc., where we can expect changes consistently. For these examples, embedding data model pattern may not be a good choice.

In a nutshell, there are different ways we can model our data in Mongo DB by embedding data or by referencing documents similar to normalizing data in SQL.

By using these two data model patterns, we can make efficient, scalable and powerful queries to documents that are completely very useful and impactful for your applications.

Based on my learning and experience, I have only touched a bit about Embedding and Referencing schema designs. There are still lot to read and learn about all different one-one, one-few, one-many, many-many relationship examples to know more detail about these data model patterns and best practices.

Please check out the official Mongo DB and Azure Cosmos DB documentation for further learning.

Thanks for reading this post!

I hope this article is informative and helpful in some way. If it is, please like and share this article. Follow me on Twitter | LinkedIn for more related tips and posts.

Happy learning!