Basics about MongoDB indexes

First of all, we run a Docker container with Mongo 5, open the Mongo shell and create the users collection in the test database.

docker run --name mongo5 -p 27017:27017 -d mongo:5.0.2
docker exec -it mongo5 mongo
use test
db.createCollection("users")

Let's create some data

To play with something familiar, we made up a sports-related collection (of course).

{
    "_id" : ObjectId("6148fddb88c2dc877518437e"),
    "user_id" : 158021798,
    "phone" : 627210141,
    "favourite_sports" : [ "ping-pong", "padel" ],
    "stats" : {
        "matches_won" : 907,
        "matches_lost" : 566
    }
}

Basically we have come up with a simple collection where we store various information about a user, such as phone number, favourite sports and some statistics of the matches played by the user. Don't pay special attention to it, it's a silly example.

To make it exciting, let's create 1 million documents following this structure, but with random data. The following script will fulfill this function. Of course, it will take a while. We only put the script because... why not? But you don't need to pay much attention to it.

function random(min, max) {
  return Math.floor(Math.random() * (max - min) + min);
}

const sports = ["padel", "tenis", "ping-pong", "badminton"];
const nDocs = 1000000;

documents = [];
for (let i = 0; i < nDocs; i++) {
    const shuffled = sports.sort(() => 0.5 - Math.random());
    const random_sports = shuffled.slice(0,
        random(1, sports.length));

    documents.push({
        "user_id": random(100000000, 199999999),
        "phone": random(600000000, 699999999),
        "favourite_sports": random_sports,
        "stats": {
            "matches_won": random(0, 1000),
            "matches_lost": random(0, 1000)
        }
    });
}

db.users.insert(documents)

So now we have our collection full of data! Let's run some queries to see how Mongo performs searching through 1 million documents. Note that to have a real example of the collection, we can run db.users.find().limit(1) and use the values from that particular document.

ID index

Let's go with the first round. Search for a document by ID.

db.users.find({"_id": ObjectId("6148fddb88c2dc877518437e")})

{
    "_id" : ObjectId("6148fddb88c2dc877518437e"),
    "user_id" : 158021798,
    "phone" : 627210141,
    "favourite_sports" : [ "ping-pong", "padel" ],
    "stats" : {
        "matches_won" : 907,
        "matches_lost" : 566
    }
}

Well, it has returned a document, but this is not what we are interested in. To see what happens behind this query, we have to execute the following:

db.users.find({"_id": ObjectId("6148fddb88c2dc877518437e")})
.explain("executionStats")

This returns data on what Mongo used to get the results. In addition, if we specify the "executionStats" option, we see statistics on how was the performance's query. In the result we can see a lot of interesting information. But for this post, we are going to focus only on some statistics. Notice that this operation of explaining the query can be done in different tools such as MongoDB Compass.

{
    ...
    "executionStats" : {
            "executionSuccess" : true,
            "nReturned" : 1,
            "executionTimeMillis" : 0,
            "totalKeysExamined" : 1,
            "totalDocsExamined" : 1,
            "executionStages" : {
                "stage" : "IDHACK",
                "nReturned" : 1,
                "executionTimeMillisEstimate" : 0,
                "works" : 2,
                "advanced" : 1,
                "needTime" : 0,
                "needYield" : 0,
                "saveState" : 0,
                "restoreState" : 0,
                "isEOF" : 1,
                "keysExamined" : 1,
                "docsExamined" : 1
            }
    },
    ...
}

Here we can see that the execution time in milliseconds has been 0 because it has gone very fast. But the most interesting of all is that the stage is IDHACK, which means that the index by _id, which Mongo creates by default, has been used. As we all know, an index is nothing more than a data structure that allows you to quickly reference relevant documents. Specifically, Mongo stores indexes in a B+ tree. The reason why a B+ tree is used instead of a B-tree, or a binary tree, or even a hash table... gives for another whole post. Maybe next time.

Note also that 1 document has been examined and 1 has been returned. Therefore we can say that the ratio of examined documents/returned documents is 1, which is optimal. This means that the rest of 999,999 documents in our collection have not been examined. And this is what we want to achieve in our system, that all queries have a low ratio to avoid going through unnecessary documents, worsening the performance.

Single field index

What if we have to search by user ID? Let's run it directly with the explain.

db.users.find({"user_id": 158021798}).explain("executionStats")
{
    ...
    "executionStats" : {
            "executionSuccess" : true,
            "nReturned" : 1,
            "executionTimeMillis" : 523,
            "totalKeysExamined" : 0,
            "totalDocsExamined" : 1000000,
            "executionStages" : {
                "stage" : "COLLSCAN",
                "filter" : {
                    "user_id" : {
                        "$eq" : 158021798
                    }
                },
                "nReturned" : 1,
                "executionTimeMillisEstimate" : 11,
                "works" : 1000002,
                "advanced" : 1,
                "needTime" : 1000000,
                "needYield" : 0,
                "saveState" : 1000,
                "restoreState" : 1000,
                "isEOF" : 1,
                "direction" : "forward",
                "docsExamined" : 1000000
            }
    },
    ...
}

It doesn't look so good here. The execution time has already been half a second (523 milliseconds) and the ratio is 1,000,000. All documents have been examined, which means that Mongo has to read every document, generally from the disk. But this is too much taking into account that only one document was returned. And this is what the COLLSCAN stage means. The whole collection has been scanned. This is of course the worst case scenario. If searching by user ID is mandatory, we have to create an index by user_id field:

db.users.createIndex({"user_id": 1})

The 1 indicates the ascending order in which the index is stored. When indexes span only one field, it doesn't matter if we put it in ascending (1) or descending (-1) order, because Mongo can read them in any direction. However, when we put more than one attribute in an index, and the query requires sorting, the direction of the indexes is important, although we are not going to discuss it in this post, so we will create them all with ascending direction.

So, let's run the query again:

db.users.find({"user_id": 158021798}).explain("executionStats")
{
    ...
    "executionStats" : {
            "executionSuccess" : true,
            "nReturned" : 1,
            "executionTimeMillis" : 2,
            "totalKeysExamined" : 1,
            "totalDocsExamined" : 1,
            "executionStages" : {
                "stage" : "FETCH",
                "nReturned" : 1,
                ...
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "nReturned" : 1,
                    ...
                    "indexName" : "user_id_1",
    ...
}

Here we can see two stages. First we scanned the indexes (IXSCAN) using the index named user_id_1, which is the default name of the index we just created. From the scanned indexes, 1 document was taken (FETCH). Therefore, thanks to the use of the index, we have only examined the document we needed, instead of the million we were examining before. With this, we managed to make the query take 2 milliseconds, as opposed to the previous 523.

Compound index

And what if we want to extract the documents where the user loves ping-pong and has won more than 900 matches?

db.users.find({
    "favourite_sports":"ping-pong",
    "stats.matches_won": {$gt: 900}
})

We can create compound indexes, and not only that, they can also be applied to lists (like favourite sports) or embedded documents (like stats):

db.users.createIndex({
    "favourite_sports": 1,
    "stats.matches_won": 1
})

Summary

Therefore, to improve your system you must identify what query patterns your system requires, and from there create the relevant indexes. Keep in mind that they are not free. Each index takes up space, and the Mongo scheduler will take longer to decide which index to use for each query the more indexes there are. Therefore, we should have only needed indexes.

And that would be all. In a more advanced post we will surely discuss more advanced concepts such as TTL indexes, the ESR rule to define the order of indexes correctly, or how to manage indexes correctly in the production environment.

23