22
How I improved nested arrays performance in our MongoDB
How it started?
In 2019 I started to work on fanficus.com. Which is Russian speaking fiction writers/readers social network. Spent around a month thinking about how to structure the architecture of the web application. In the beginning, I didn’t know exactly what I am working on. Initially, it seemed a small side project for a few months.
When starting I decided to choose to MEAN full-stack (MongoDB, Angular, ExpressJs, NodeJs). However, it was a dilemma what to choose MySQL or MongoDB. Because previously I had some experience with MySQL and I knew SQL databases had a major market share in web projects. MongoDB was chosen because it is based on javascript objects, therefore it’s natural to use on the current stack.
The largest part of the web app is based on posts. They can be of different types and include a lot of parameters. Such as age rating, genres, tags, etc.
Understanding that the database has to be flexible I decided to avoid nested arrays of objects. Because that was creating a risk of misalignment between collections. For example, there is a post that has genres array objects. Genre contains title and id. But at the same time, there is a genres collection. Where all genres are listed. We use it when creating new posts.
post: {
genres: [id1, id2, id3],
tags: [id1, id2, id3]
}
Now, what happens if we rename a genre? in the genres collection, it is renamed, but in all posts containing the genre, it remained with the old name. So we end up with a post that contains inexisting genre.
Yes, you may say each time we update the genre we can update all the posts containing this genre. Though I wanted to avoid this obsolete load on the database.
The way was chosen is to store in the post only an array of IDs of genres. It seemed the most effective solution. Anyway, it was more optimal than going in an SQL way and having three collections: posts, genres, post-genre.
Problem
The website was released and worked well enough. Even on each request, it was populating all the posts with an age rating, genres, etc. As a result on the frontend, I was receiving the full objects and didn’t need to run additional queries.
However, after a few months number of posts increased. We started to mention slow loading times. After spending some time on the investigation figured out what was the issue. And the problem was in an ****array of IDs. Each post contained up to ten such nested arrays. Meaning each post had to run ten population commands.
PostModel.find({}).populate(‘genres’).populate(‘tags’).populate(‘ageRating’).exec();
It was not the only problem. The way we run search queries on posts also was impacted by the way we store nested IDs. Each time we run a search on the website it was seeking tags titles, then we took IDs and run a posts query.
const tagsFound = await TagModel.find({‘title’: { $in: keywordsRegArr }}).exec();
const tagsIdsArr = tagsFound.map( tag=> tag._id );
PostModel.find({tags:tagsIdsArr}).exec();
In addition to page load performance, we saw our MongoDB Atlas server was charging too much.
How it was solved?
After long arguments with myself, I saw that the most optimal solution will be to store the full objects inside each post. Yes, it was increasing database size. However, it was solving the main issue with performance and querying.
Now posts collection looked this way:
post: {
genres: [{id: 1, title: 'one'}, {id: 2, title: 'two'}],
tags: [{id: 1, title: 'one'}, {id: 2, title: 'two'}]
}
It was meaning now each time we search posts we don’t need to search tags to get their IDs. Now it’s just running **text search **over all posts.
One more important thing that was added is caching. For that one, I used node-cache npm package. Part of the queries is cached on the NodeJs. That way we reduce the load on the database. Some requests are cached for hours, some for minutes.
Result
As already was said, now we were able to run a text search query and avoid multiple populations.
Post objects were retrieved from the posts-collection directly without any manipulations.
As a result query and loading times were reduced at least by four times. And it’s on the current database which is in comparison to social networks is small.
Drawbacks
Now each time we change genres, age ratings, etc, we need to update all posts containing these objects. But these items change rarely, so we can absorb this one.
Then also I had to modify search queries from the client app. Because post-collection contained a nested array of objects instead of an array of IDs
Storage size increased. In the end, the database size didn’t increase significantly. We didn’t even mention it.
Conclusion
The project was released a year ago. At the moment we have 22k users and it’s growing. We have a lot of features implemented, though a lot more are in development. There are many issues we meet on the road. However, overcoming them and conquering the next mountain peaks.
Now the database refactoring is tested on our testing server and gonna be released soon.
22