October 14, 2020

Getting started with MongoDB: what I’ve learned

Before working for Bugsnag, my main database experience was Microsoft SQL and MySQL which are relational databases. The first time that I looked at the NoSQL database MongoDB, I struggled to leave the relational database mindset behind. I found it easy to create and find a document then I fell into the mistake of creating a document in one collection and trying to work out how to link it to a document in another collection. But of course that’s not how NoSQL works, and it took me a while to get my head around. 
Sure there can be a customerProjectId field in an events document collection, and then a document in the customer_project collection that has an _id field that relates, but they are in no way linked. If the customer_project document is deleted, the event document will still remain. 
For a short time I found this hard to get my head around. Surely this would open up a can of worms. I can get this event document, but if I want to then get some more information about that customer_project, I’ll get an error because that customer_project no longer exists. Turns out, that’s ok. I should get an error in that situation. It means that the events collection hasn’t been cleaned up to remove events for that now deleted customer_project
Within a couple of weeks working at Bugsnag I was happily querying my development database, getting a feel for how we stored our documents, and creating queries that I would then need to use in the code I was creating. It wasn’t long before I got rid of the relational database mindset.

If I want to find all of the events for my test project, I knew I first had to query the customer_projects collection to get the ID of that project. Then I could query the events collection using that customerProjectId. It would look something like this:

db.customer_projects.find({"projectName" : "wills test project"})

Which would return a document like this:

  "_id" : ObjectId("5df7768ed1befa0009fe5acc"),
  "projectName" : "wills test project"
   // other fields

Then I could query the bugsnag_events collection to get all events for that customer_project_id:

db.bugsnag_events.find({"customerProjectId" : ObjectId("5df7768ed1befa0009fe5acc")})

If I had been doing that in SQL, it would have been able to do it in a single query using a join. It would look like this:

SELECT * FROM bugsnag_events
INNER JOIN customer_projects ON bugsnag_events.customerProjectId = customer_projects._id
WHERE customer_projects.projectName = "wills test project"

At first I thought that SQL had the better approach, but then I realised that if the customer_project record didn’t exist, I wouldn’t get any results back. But how could I know that it was because there were no events for that customer_project or that the customer_project didn’t exist. Sometimes that doesn’t matter, but sometimes it does as there is a huge difference between no events and the project not existing. For this, I found that using MongoDB is probably a better approach; look for the project and then the events after. If a project doesn’t exist, then I know where the problem is.

Document structure

Another concept in MongoDB that was so foreign to me was the idea that a document could contain sub documents. When I heard about this, I suddenly thought that what I had learnt about MongoDB not being a relational database was false. A sub document must be a different document, which means the first document must relate to that sub document somehow. But then a fellow engineer explained it to me with a single word. JSON. As a software engineer I am very familiar with JSON. A MongoDB document is a form of JSON. A key value pairing, where the value can be another JSON object, or in MongoDB world another document. (Technically MongoDB uses BSON which is binary JSON). A document with a sub document might look like:

  "_id" : ObjectId("5f36d24b000001bdbd240000"),
  "customerProject_id" : ObjectId("5df7768ed1befa0009fe5acc"),
  "notifierUsed" : {
      "name" : "Bugsnag Go",
      "version" : "1.5.3"

This is a small part of a bugsnag_event document. It has an _id field and customer_project_id that I’ve already spoken about. But it also has a notifier_used field. That field’s value is another document itself. 
This kind of structure is what makes MongoDB so powerful because you can store anything inside your documents. Not only that, but there is no predefined structure to how a document inside of a collection should be. That means I can have documents in my bugsnag_events collection that don’t have that notifier field. 

Sharded collections

Sometimes a collection I want to query is sharded. This means that the data I want to find is split across multiple machines.
We have several sharded collections at Bugsnag due to the huge amounts of data we have stored. We process millions of crashes and billions of sessions a day, and to store all of that on a single machine just wouldn’t work. So instead we shard our data across multiple machines using a shard key of projectId and _id
This means when querying documents, I can use the projectId as part of the query and MongoDB will know which shard the document(s) will be stored on. This makes a huge difference in performance. I always have to think about this when writing my queries in code or reviewing other people’s code to make sure that the right key is used when querying a sharded collection. It’s definitely a gotcha that has caught me out and someone has spotted while reviewing my code.
See this excellent Bugsnag blog post on Sharding: https://www.bugsnag.com/blog/mongo-shard-key 

The power of the ObjectID

ID fields in MongoDB tend to be of type ObjectID. This is a very powerful type. I’ve used it in my examples so far in my ID fields. At first glance it looks like a randomly generated string of characters. But really there’s more to it and it contains some very useful data; a time stamp.
From the MongoDB documents, this is how an ObjectID is created:

  • a 4-byte timestamp value, representing the ObjectID’s creation, measured in seconds since the Unix epoch
  • a 5-byte random value
  • a 3-byte incrementing counter, initialized to a random value

This means that when you create a document, you don’t need to have a “createdAt” field because it’s right there in the ID. This allows you to query collections to find events that are greater than a given date by providing an ObjectID
To create a new ObjectID from a given timestamp, there is an inbuilt function as follows:

 // returns ObjectId("5f66f2260000000000000000")

So if I wanted to query events that were newer than this I could write this command:

       "_id" :
               "$gte" : ObjectId("5f66f2260000000000000000")

This would return me all events where the _id is greater than or equal to the timestamp I wanted. I like this feature a lot. 

Configure your MongoDB shell

The MongoDB shell is a very powerful tool. You can do everything you want from the shell. One trick I learnt from a fellow engineer while writing this blog post, is that you can configure it to add some default settings like function alias’ and default settings. 
To do this, you create a file called .mongorc.js in your user home directory.

In this file you can then add things such as being able to display the current database name in the prompt. I find this very useful as it means I always know which database I’m using. This can be done by adding in the following function:

 // Override the prompt to add in the database name
prompt=function () {
   return db+'> ';

Another is to enable .pretty by default. When you run a command in the MongoDB shell, it will return results in a long string, which is very hard to look at. If you add .pretty() to the end of the command, it will structure results in a nicer way that looks more like JSON. To do this by default you can add the following command to your .mongorc.js file:

 // Pretty print by default
DBQuery.prototype._prettyShell = true

The last example is being able to quickly run a query using an alias. In my local Bugsnag development environment I have a lot of projects. Most of them are the different services we have in the Bugsnag system, but I also have lots of test projects and I can never remember their names when looking for them in MongoDB. So I have an alias setup that will return all of the projects for my Bugsnag account and only return the _id, name, and apiKey of the project. It looks like this:

 // get all projects in Bugsnag account
projects=function () {
 return db.projects.find(
     deleted_at: null,
     account_id: ObjectId("5f36d227d1befa0007935e49")
     name: 1,
     apiKey: 1,

I wish I had found out about this sooner as it’s a really powerful trick. I can see myself adding in lots of different alias functions to do the commands I find myself running often. 

NoSQL isn’t relational, but it turns out that’s OK

When I first had a look at MongoDB, it was quite daunting. Coming from the world of relational databases, suddenly being able to have anything you wanted inside a document was so foreign to me. But as with how I learn most things, I jumped in the deep end and just started playing around. Thankfully when I started at Bugsnag, our dev environment contains a lot of pre populated data that I could play around with, and since it was MY development data, if I broke it, I could wipe it clean and start fresh (I’ve only had to do this twice, honest).

However the things that have impressed me the most about MongoDB are the tools and “customization” that you can do. I really like being able to use a script of functions to use with the MongoDB shell. I have so many ideas of scripts I want to create that I could probably lose a lot of time creating them. Being able to do everything from a terminal really pleases me as a techy and software engineer. If I can use a terminal and write a script to automate something so that I don’t have to use a clunky GUI, I will!

Hopefully this has been of interest to you and if you’ve not used MongoDB before, it might inspire you to give it a go. Seriously try it out. I really wish I had tried it sooner. I find it a really interesting subject and when I review code that contains MongoDB queries, I will always read up on what those queries are and how they work. Mainly for my own learning benefit, but sometimes I may have learnt a “better” way of doing something that I can pass on.

BugSnag helps you prioritize and fix software bugs while improving your application stability
Request a demo