Monday, March 16, 2015

It's time to take another look at MongoDb

MongoDb is one of the most popular database systems today and certainly the most widely known document store solution.  Since the last version (3.0) has been released this month, I've decided to write a post on what are its strengths and weaknesses (in my opinion) and some tips in writing Mongo client applications.

The bad news

 

Despite the hype, a strong wave of criticism has emerged in the last years pointing out several important flaws:

Unlike most database systems, durability is configurable in MongoDb, and more importantly, the default value does not assure it. For durability I'm referring to the basic guarantee that, if a write to the database is successful, the data has been stored and if it if was not due some problem the client is dully notified (we are not talking here about replica based fail-overs during a crash or eventual consistency between replicas). The universal mechanism used for that is a write-ahead-log that keep tracks of the operations pending to be processed, a write is only successful if it has been recorded in the log. In MongoDb this mechanism is called journal and by default clients does not wait for the journal to be written to return from a write operation. The operation stays in the server memory till a thread dumps it to the journal asynchronously, thus we have a time window where , if the server crashes, there is an undetected data loss. Fortunately you can configure the client to wait to to the journal write before returning a success, but it is not the default behavior, and it takes a performance hit. Most of the initial MongoDb benchmarking were done without journaling making the comparison to other systems a bit unfair.

Initially MongoDb write locking system was global meaning that the whole mongod server instance was blocked for each write effectively serializing all the insertions, in later releases the lock changed to database level, forcing in most cases to model the data using a unique collection per database to parallelize data insertion. 

Scalability. Mongo is not the ideal repository when thinking in web-scale requirements, certainly not when compared with systems designed to store distributed data sets on the petabyte range like Hbase or Cassandra. Despite of the marketing and branding name (Humongous) it does not seem that it was initially designed with cluster distribution in mind: when using sharding and replication, you need different replica node sets for each shard: for example, a total of 9 different nodes is needed for supporting 3 shards with a replica factor of 2. the configuration is also complicated needing separate managing processes (mongos) also configured redundantly in HA, and other extra processes to store and serve the cluster metadata (configuration manager) deployed in a cluster of 3 nodes. On top of that the database functionality is limited when using sharding

Storage. MongoDb is schema-free but not schema-less, storing the documents in BSON format means that each document is stored with its own schema: field names, structures, hierarchy relations.. This is a great storage overhead compared to other schema bound solutions.

The good news 

 

Obviously a product so successful had several strong points:

The document format, BSON can be directly mapped from JSON, this is a huge advantage for front end development based on javascript: the ajax messages can be directly stored in the repository. There is no relational mapping , no JPA needed.

Schema flexibility means that there is no problem with schema changes, new fields, field removal...,  One of the most painful scenarios we can found when using a relational database is gone.

This is in part possible because there are no relation between collections, no multi-collection transactions or joins.

New developments


This month MongoDb published the new 3.0 release, including a new storage engine, WiredTiger, somewhat fixing two of the most criticized points:

WiredTiger supports compression (two codecs, snappy and zlib). This is a huge storage and disk IO improvement. Mongo claims that data storage can be reduced in a 70%. In our particular case I tested the size reduction in more than an 80% using Snappy. (Snappy aims for fast compression and reasonable compression whereas zlib provides maximum compression but is slower). The tests also showed improvements in the average writing speed the reduced disk I/O compensated the cpu time used in compression.  My case is specially favorable since the documents are very big and have a high degree of redundancy.

WiredTiger write locking is done at document level providing the highest throughput. Even if you use the former mmap engine, locking is now done at collection level.

Building our stack

 

MongoDb is a good solution when you have to store unstructured data. It also gives you the possibility of horizontal scaling and high availability out of the box. Compression was a critical improvement since data redundancy is inherent to the model: there are no relations, nor joins, so all needed data has to be stored in each collection even if duplicated in a de-normalized fashion; the schema is free, meaning it is stored with each document. The reduced storage needs imply that sharding can be minimized or disregarded and an increment of the scale out limits.

There is no excuse for not using Object Document Mapping

When using MongoDb from Java the first thing that we notice is the impedance mismatch between the java object and the BSON document. This is can be addressed using an object document mapper. Unlike ORM where the impedance with the relational model can be very high, and there are advocates of either using directly SQL or a mapping framework like JPA, there is no discussion here that the best approach is to use a document mapper, since each document will be stored in a collection, not normalized, there will be no relations, external keys, more than one table involved...

Use Jackson

Usually the only variable measured when choosing a json mapper is the performance. It is a main factor for sure, but the feature set provided is equally important. Jackson is in the top regarding the serialization speed and is, hands down,  the more flexible and cofigurable, providing the richest feature set. With Jackson you even don't need to annotate classes directly or you can do it externally using  Mix-in.

Leverage the existing frameworks

There are two frameworks that provide databinding using jackson as the json mapper: mongojack and jongo. The two of them serialize directly to BSON removing an intermediate step. I opted for mongojack since the approach is to map a collection to a generic type in creation, and jongo needs to receive the destination object class on each query call, this can be useful if you need to change it for different queries. Either once in creation or in each query call the class must be provided due to Java type erasure.

Keep it simple, your data is never so unstructured

Use the class definition as the schema of the collection, (that's the approach with mongojack), mapping one type to one collection, leverage inheritance so instances of extending classes can be stored in that collection representing evolving schemas. Jackson supports the storing of the class type as metadata in the document to provide polymorphic deserialization, not existing fields can be configured to be ignored, I have tested it with mongojack and it works perfectly.

Use ObjectId as the key for all the collections, look at it as a sequence or auto-incremental synthetic id guaranteed to be unique in the cluster and containing the document creation timestamp.

Write in batches

Although MongoDb provides a configurable durability model, the reality is that the vast majority of applications will need some storage guarantees, that is a journaled write concern.

This is implemented at MongoDb server side as an scheduled thread that writes to the journal the operations queued in memory periodically. If we write in batches, aside of  the benefit of minimizing network trips, we maximize the number of operations that will be saved to the journal in the next period.

Your mileage may vary

I arrived to this guidelines after analyzing the different storage requirements of several components in our architecture and examining MongoDb capabilities and limitations. I designed a general data abstraction layer using this ideas targeting both ease of use and performance.

While I think that this ideas can be applied to most of the cases, your scenario may need a different approach (or a different database solution). In any case I think that, even if you discarded Mongo before, with the improvements added to the last release, may be it is worth a reassessment.

No comments:

Post a Comment