Guide to MongoDB for startups

Relational databases have been traditionally been used to store and query structured data for quite some time now. Relational databases if used correctly have been giving us atomicity, consistency, isolation, durability. But over last couple of years NoSql databases have gained quite a lot of traction.

With the evolution of Internet and smart phones the data has been increasing exponentially. With the increasing data the search engines, e-commerce stores need to provide us with accurate results.With these growing needs the it was getting even more difficult to define a fixed structure to the data and the need to have a solution which handles unstructured data grew even more.

In around 2007 NoSql databases were used by big companies which were processing huge amounts of data. Google had big table and then Amazon had built Dynamo. The release of Dynamo: Amazon’s Highly Available Key-value Store paper started a new wave of open source NoSql databases.  Hypertable which was based on google’s BigTable  was released in 2008, Riak, an open source clone of the Amazon’s DynamoDB was launched in 2009 , and MongoDB soon followed.

What was ACID to relational databases became CAP for the NoSql world. In theoretical computer science, the CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it was successful or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

So most of the NoSql databases traded consistency for Availability and Partition Tolerance which was a fundamental shift from the relational world where in you had to design your schema perfectly so that there were no inconsistencies in the data ( 1NF, 2NF, 3NF, BCNF).

You also got schema flexibility , i.e it was not required to have one fixed schema for similar documents each of the document could have its own set of field and still it would work fine.

Out of all the NoSql databases MongoDB gained quite a lot of traction and was seeing a lot of adoption which was backed by good support from the community and 10Gen which released the db.

In this blog i have tried to cover some important concepts which MongoDB uses , which will help you to evaluate the database.


MongoDB stores data in BSON format which is an extension of the widely popular JSON format and adds in support for int , long , floating point , date and more. Since its a binary format it stores the document in the storage efficiently as opposed to storing raw string JSON.

Collections and Document

MongoDB has concept of collections and document , where in a collection as the name suggest is a group of logically similar document , they need not have the same schema or structure . Ideally you can have all the documents with different schema in one collection but you would not have that. Basically a collection makes it easier to group models which are similar and would want to have similar indexes , capping and other properties.


MongoDB supports some relationships between documents. A document can either “Refer / link” another document or it can completely “embed” the other document. Both the strategies have their advantages and problems. But one of the biggest things to notice is that mongodb does not has any “Joins” , and i don’t think we will introduce join in near future because thats against  the basic philosophy of NoSql denormalized data.

References are as simple as storing as id of one document in another.

original_id = ObjectId()

“_id”: original_id,
“name”: “something”,

  “age” : 45

“name”: “some child”,
“parent_id”: original_id,

doing this gives you a logical segregation of documents and you can always find a child object and get its parent from db.child.find({ “name” : “some child” })[0].parent

On the other hand embedding the document , means instead of storing the id we are storing the complete document there only.

original_id = ObjectId()

“_id”: original_id,
“name”: “something”,

  “age” : 45

“name”: “some child”,
“parent”: {
“name”: “something”,

          “age” : 45


In MongoDB there is atomicity at the document level , if you have a write operation it can change only one single document atomically. Your choice of whether to use a reference or an embedded purely depends on the use case.

For example if you wanted to run a query wherein you wanted to find all the parents whose age is more than 45 years , you will have to go through each and every child and then find the parent whose age is more than 45. Which might be fine if there are like 100’s children , but in case of a million children this will take a lot more time.

Also  since you are interested in only the embedded object you would still be returned the full contact object, since embedded collections don’t live outside of the collection embedding it .

This  might seem not a big problem when the model containing the embedded document is small ,consider a case wherein you have a big collection embedding a smaller embedded document , in that case the data transferred over the wire increases significantly.

Or if you  know that in future i would have to do some analytics or calculation on the embedded collections alone, i would not make them embedded since it makes the query process slow and the application layer gets complicated in order to fine tune the query in order to perform well ,

but on the other hand if i care more about the atomicity of the document than anything else then i would make them embedded since it guarantees the atomicity of insertion of one document .

MongoDB also imposes a restriction on the size of the document and an embedded document is considered as a part of the enclosing document . Currently the cap for a document size is 16M which is fairly good for majority of cases but in case you know by embedding documents the size will go more than 16M , you will have very less choice but to split the document in references.

If you want to store documents more than 16M mongoDB provides GridFS API , which is a file system implementation on mongoDB for the same

Sharding and Replication

One of the biggest advantages claimed by NoSql databases is Horizontal scalability , which is the ability to handle more load by adding in more machines, since the computing power these days has gone cheap compared to the effort required in to fine tune the app and add more computation power and memory to an existing machine. This is become the choice for most of developers and companies.

MongoDB enabled horizontal scaling with what it calls “Auto Sharding” , Sharding distributes data across physical machines which can serve the data in parallel. Sharding frees you from RAM , I/O bottlenecks you get when using a single servers as there is only so much computing power you can put in one machine .

it reduces the queries / load one server can handle and distributes it across the shard , this enables the database to handle more queries with proportionately less load on each machine.Each shard processes fewer operations as the cluster grows. As a result, shared clusters can increase capacity and throughput horizontally.

A typical mongodb cluster will have the following components

  • Sharding Servers
  • Config Server
  • Query Router

Mongos act as the router to distribute the calls to its respective shards , and return the response. Application code will connect directly to a mongos instance only and it will take care of the rest. It basically shields the application layers from any knowledge of sharding which exists , for an application nothing will change if its one server or 100 shards.

Config Servers store the meta data for shards , in production mode you would want at least 3 config servers , it uses the same concept of replication as a mongo shard storing any other data.

basically by using only 1 config server we are making it a single point of failure for the application. When using a 3 config servers, one acts as a master and rest 2 as replication copies the data between the the other two config server is copied asynchronously from the master.So now even if two config servers fail the application still keeps on working.

Shard replica set stores the sharded collection data , every shard in production should contain at least 3 servers for reliability and fault tolerance of the data. In every shard there is one primary machine which returns the response of the queries to mongos , the other two machines are secondary and contain copy of the data in the primary shard which is copied asynchronously from the primary machine, when primary fails or goes down  reelection happens and a new primary is chosen from the available machines in the shard. This helps in serving the request to mongos even if 2 machines fails as the last machine will still have almost up to date data.


Oplog or operation log is very critical to understand how mongodb does its replication. Whenever there is any modification to the mongodb instance the database will first make the changes in the primary and then write the operation in a separate “capped collection” called as oplog collection. The secondary then read this oplog in an asynchronous process and apply the modifications to its database.

Another very interesting property to keep an eye on is Replication Lag , it is basically the delay you want to have in coping the data from the primary to secondary , the more the lag is the more out of sync data becomes in the secondaries. Also if the duration is very less then it will add more load to the machines , and if you are reading from secondaries it will come in between your reads so you need to carefully choose the replication lag based on your use case.

MongoDB Concurrency

MongoDB uses reader-writer locking mechanism , it gives concurrent access to reads but exclusive rights for write operations which means it can handle concurrent read operations but if there is aright operation it will block the reads until it gets completed .

Prior to mongodb version 2.2 mongodb had an instance level lock which means that whenever there was a write operation it used to lock the entire mongodb instance and even if there was a read queued for a different database it will have to wait as the write operation blocked the entire mongod instance.

This was changed in 2.2 where in the write operation locked the whole database instead of the complete instance. So the solution which was left to scale the write operations was to add in more shards and route the next write query to a separate mongodb shard instance.

For applications this is by far the most important point to take into consideration when choosing which database to use, as for write intensive application this might come in their way to scale.


  1. Pingback: Guide to MongoDB for startups | webdevilopers |...

  2. Patrick Reply

    BSON is not more efficient for storage than raw JSON. BSON actually increase the document size by adding extra data such as type and size information. The benefits of BSON are that it is efficient to encode and to traverse the structure.

  3. sac tote burberry Reply

    I’m curious to find out what blog system you’re working with? I’m experiencing some minor security issues with my latest website and I would like to find something more risk-free. Do you have any solutions?|

    1. optinidus Reply

      we are using wordpress currently. it’s pretty stable and widely used, has good community support and forums. otherwise tumblr and github pages are good too.

Leave a Reply

Your email address will not be published. Required fields are marked *