Tech

How to do Query Optimization in MongoDB

Huge amounts of data are exchanged every day, and virtually everything has made its way on to the internet. So of course we need databases to save the data and retrieve it when needed.

There are a lot of articles on improving the performance of SQL databases, optimization, and their limitations. However, with No-SQL there is only a handful resources, which is why we end up making poor choices either selecting or using No-SQL databases.

This article will focus on the performance of MongoDB (No-SQL), starting from basic queries to complex ones and explaining them along the way using MongoDB .explain() method. We will be figuring out how MongoDB executes a query, the amount of time it takes, which index is being used and why.

Tools

Before we start here are the main tools I’ll be using:

The article covers the following chapters (click to navigate):

  1. Setup
  2. Why You Need Indexing
  3. How to Do Index Selection
  4. Joins
  5. Denormalizing
  6. Lookup

1. Setup

Let’s walk you through the database we’ll be using: 'developer' is the collection name and the document consists of first_name, last_name, email, date of birth (dob), age, language, country, and experience.

I used mgodatagen to create dummy data and added 5 million records just to be as practical as I can be in terms of the number of records in a real-life production database.

2. Why You Need Indexing

Let's say I want to find all Ukrainian developers. This is a simple query and I believe if you’re reading this you must have a good idea of the 'find' query in MongoDB. This is a screenshot of MongoDB Compass, which will help us dig deep into the explain method and analyze it.

Now if you take a closer look, you'll see it took 1128 ms by MongoDB to complete this query. Keep in mind I am not using any kind of index and it can be seen in the screenshot yellow triangle.

In the details section, you’ll see a stage variable named COLLSCAN which tells us that the database searched every collection in the database to get results. 'nReturned' is the total number of records returned by the database.

In MongoDB query executions divide their work into small units.

The works variable tells us the number of work units performed by the query execution stage. A 'work unit' might consist of examining a single index key, fetching a single document from the collection, applying a projection to a single document, or doing a piece of internal bookkeeping. In our case, it is 5m and 2.

The next important value to look for in detail is docsExamined which is the number of documents scanned during the query execution stage. In this case, it is 5m; basically, all 5m records have been scanned during execution.

Now I don’t have to tell you that it is dumb - searching through the whole database every time we have to fetch something from the database. But it gets even more interesting. What if I wanted to get all the developers that are from Cyprus and the programming language they use is C++? Here we go! 👇

Look at the 'docsExamined' same 5m and so are the other variables we talked about above. Most importantly, the execution time increased from 1128ms to 1157ms. Now you can imagine if I add other filters to the query, the time will increase regardless of the type of filter we add.

This is the reason we need indexing in our database. Database indexes are a type of data structures that improve the speed of retrieving data (of course, at the expense of storage) because it has to be saved somewhere. They provide the ability to quickly locate desired results without actually having to search through every record in the database.

MongoDB uses a B-tree data structure for creating its indexes. It stores a small portion of collection data in an easy-to-traverse form. The index stores the value of the field specified while creating the index or a set of fields (compound index) ordered by the value of the field.

The ordering of index entries supports efficient equality matches and range-based queries.

MongoDB offers different types of indexes for various lookups:

  • Single field index (single field);
  • Compound field index (multiple fields in the document);
  • Multi-key index (index to sort data in arrays);
  • Geospatial index (indexing geospatial coordinates data);
  • Text index (string content in the collection);
  • Hashed indexes (index the hash values of specific fields to support hash-based sharding).

3. How to Do Index Selection

For us to get the best of indexes, there are some things to keep an eye on. If there is an index that can satisfy the query by using index-scans only, that's when we call this index Covered, as we no longer need any additional work for the query. These queries are called Covered Queries.

For an index to be covered, there are two conditions:

  1. Every field in the query is a part of the index.
  2. Every field to be fetched as a result of the query is a part of the index.

Let’s take an example in our dataset. Remember we searched for all the developers in Cyprus that are using C++? Now we need to create an index on the country field, do the same query and see the effect on query time, work units, and docsExamined.

Look at that! By using a single-field index, we have reduced the time 1157ms to 31ms (16+15). That’s almost 38 times faster. And now if we check work units, we see it is reduced to 20,456 from 5m and 2. Then docsExamined dropped from 5m to 20455. And most importantly, the stage is now IXSCAN which from its name suggests that the database used an index to perform this query.

We have drastically decreased the query time but the next question is can we do better than that? Can we decrease the execution time, docsExamined number? Yes, we potentially can decrease time even more.

We talked about covering indexes well for our query when we create indexes that have all the fields in the query (country and language) and the return from the index also includes the query fields. Let's now create an index and run a query:

We did it again, now the query execution time is down to ~0ms. We started from 1157ms to ~0ms. The docsExamined number is down to 27 which is the actual number of results returned.

Small tweaks on how you select and use indexes can majorly affect the time and efficiency of query execution.

Indexes do speed up the query execution dramatically but there is also a cost behind it: they use memory and adding too many will cause the working set not to fit in memory, so always make sure you're indexing (but not too much).

4. Joins

The most commonly-known drawback of No-SQL databases like MongoDB is the lack of JOINS on database-level. That said, No-SQL databases do not have schemas (which makes it easier to write on databases), while the key-pair makes them unique and popular against typical SQL databases.

To overcome the lack of database-level JOINS, you can do application-level joins but that will be costly because then you’ll be traversing all the data in the database and above that, you’ll first have to get all data to the application, apply join then perform any action/manipulation on the data.

5. Denormalizing


Another way of overcoming the need to join is denormalization of collections in databases. Imagine if we had one more collection called 'job' which would have all the information about a developer's job like salary, company, job title, and working hours, you get the idea.

{
        "_id": {
            "$oid": "60029896ec15b4c9f3e91871"
        },
    "dob": {
        "$date": "0001-01-01T00:00:00.000Z"
        },
    "email": "camdenwisoky@rice.io",
    "language": "C++",
    "country": "Cyprus",
    "experience": 18,
    "age": 25,
    "first_name": "Carlie",
    "last_name": "Mills"
}

{
    "_id""{
        "$oid": "60029896ec15b4c9f3e91871"
    },
    "developer_id": "60029896ec15b4c9f3e7d1ca",
    "salary": 1200,
    "company": "Microsoft",
    "workinghours": "0800-0500"
}

Both collections will look like this. To denormalize what we can do is trading off writing performance with reading performance.

Instead of joining at the application-level, we can write job information in the developer collection. That will eventually remove the dependency of using application-level or the need for joins at all.

However, that comes with storage cost, meaning our writing performance will also be affected. Same with updating records. In turn, we will not need joins and our read performance is bound to increase.

{
    "_id": {
        "$oid": "60029896ec15b4c9f3e91871"
    },
    "dob": {
        "$date": "0001-01-01T00:00:00.000Z"
    },
    "email": "camdenwisoky@rice.io",
    "language": "C++",
    "country": "Cyprus",
    "experience": 18,
    "age": 25,
    "first_name": "Carlie",
    "last_name": "Mills"
    "job": {
        "salary": 1200,
        "company": "Microsoft",
        "workinghours": "0800-0500"
    }
}

6. Lookup

MongoDB introduced the $lookup operator in version 3.2 which comes in handy while dealing with joins between databases.

It is an aggregation pipeline and performs left-outer join on an unsharded collection in the same database (sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations).

{
    $lookup:
        {
         from: <collection to join>,
         localField: <field from the input documents>,
         foreignField: <field from the documents of the "from" collection>,
         as: <output array field>
        }
}

This is the basic syntax for lookup operators. If we take the example of developer and job collection, then by using a lookup operator we can query the database like this:

{
    $lookup:
        {
         from: "developer",
         localField: "_id",
         foreignField: "developer_id",
         as: "job"
        }
}

It takes away some pain of performing joins in MongoDB but it is far away from Joins in SQL databases like MySql. And it almost always requires an index to support the join.

Lookup operator only allows us to do a left outer join, meaning we can't choose any other type of join like right outer join, inner join, etc. MongoDB at the end of the day is a simple documentDB and is not optimized for relational data at scale.

Last but not least...

Takeaway time!

What is most important that you take from this article, is this: small tweaks on how you select and use indexes can majorly affect the time and efficiency of query execution in MongoDB.  This will help you avoid making poor choices using a No-SQL database like Mongo.

Hopefully, this article helped you understand how to obtain better performance with MongoDB queries. If you still have questions, we warmly welcome them in the comments section ☺️. Oh and if you learned something in this article, maybe you want to learn some more by subscribing to our newsletter (check below).



Have an app idea? It’s in good hands with us.

Contact us
Contact us