MongoDB Vs Cassandra

MongoDB vs Cassandra

Cassandra Logo

Over the 2 years we’ve been using MongoDB in production with our server monitoring tool, Server Density, we’ve built up significant experience and knowledge about how it works. Back in 2009 when I was looking at a replacement for MySQLI looked at Cassandra but dismissed it because MongoDB had several advantages, and Cassandra was still extremely early stage (even more so than MongoDB at the time). Having been invited to give a comparison at the Cassandra London Meetup, I thought I’d revisit it to see how it compares today.

Disclaimer: It’s important to note that much of what I know about MongoDB has been learnt through using it in production. We don’t use Cassandra so any comparisons are going to be fairly superficial but they will still be relevant because that’s the stage most people will be in when they are considering which database to pick. As a result of this I will try to avoid making technical comparisons about specific features because this will be biased towards my extensive understanding on MongoDB vs a limited understanding of Cassandra.

As such, this comparison is split into 2 types of difference – usage and operations.

  • Usage: The actual usage as a developer implementing the application with the database.
  • Operations: Points which are not directly about the core database but it’s suitability for production and management on an operational level.

That said, I will start with several technical comparisons because these are important to understand.

Usage – Structure

MongoDB acts much like a relational database. Its data model consists of a database at the top level, then collections which are like tables in MySQL (for example) and then documents which are contained within the collection, like rows in MySQL. Each document has a field and a value where this is similar to columns and values in MySQL. Fields can be simple key / value e.g. { 'name': 'David Mytton' } but they can also contain other documents e.g. { 'name': { 'first' : David, 'last' : 'Mytton' } }.

In Cassandra documents are known as “columns” which are really just a single key and value. e.g. { 'key': 'name', 'value': 'David Mytton' }. There’s also a timestamp field which is for internal replication and consistency. The value can be a single value but can also contain another “column”. These columns then exist within column families which order data based on a specific value in the columns, referenced by a key. At the top level there is a keyspace, which is similar to the MongoDB database.

A good set of data model diagrams for Cassandra can be found here.

Usage – Indexes

MongoDB indexes work very similar to relational databases. You create single or compound indexes on the collection level and every document inserted into that collection has those fields indexed. Querying by index is extremely fast so long as you have all your indexes in memory.

Prior to Cassandra 0.7 it was essentially a key/value store so if you want to query by the contents of a key (i.e the value) then you need to create a separate column which references the other columns i.e. you create your own indexes. This changed in Cassandra 0.7 which allowed secondary indexes on column values, but only through the column families mechanism.

Cassandra requires a lot more meta data for indexes and requires secondary indexes if you want to do range queries. E.g. if we define a new column family with 1 index:

$ bin/cassandra-cli --host localhost
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown] create keyspace demo;
[default@unknown] use demo;
[default@demo] create column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS}];

then we cannot do range queries:

[default@demo] get users where state = 'UT' and birth_date > 1970;
No indexed columns present in index clause with operator EQ

We must create a secondary index:

update column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS},
... {column_name: state, validation_class: UTF8Type, index_type: KEYS}];

Then Cassandra can use the state as the primary and filter based on the birth_date:

get users where state = 'UT' and birth_date > 1970;

(Code samples taken from this blog post).

Usage – Deployment

MongoDB is written in C++ and provided in binary form for Linux, OS X, Windows and several other platforms. It’s extremely easy to “install” – download, extract and run mongod.

Cassandra is written in Java and has the overhead that brings, but also the easy ability to integrate into existing Java projects. It takes a little longer to get started but there is a demonstration of setting up a 4 node cluster in less than 2 minutes, which you’d struggle to beat with MongoDB.

I know plenty of people running MongoDB on Windows but would be interested to hear if that’s the same with Cassandra (I suspect it’s more Linux).

Operations/Usage – Consistency/Replication

In MongoDB replication is achieved through replica sets. This is an enhanced master/slave model where you have a set of nodes where one is the master. Data is replicated to all nodes so that if the master fails, another member will take over. There are configuration options to determine which nodes have priority and you can set options like sync delay to have nodes lag behind (for disaster recovery, for example).

Writes in MongoDB are “unsafe” by default; data isn’t written right away by default so it’s possible that a write operation could return success but be lost if the server fails before the data is flushed to disk. This is how Mongo attains high performance. If you need increased durability then you can specify a safe write which will guarantee the data is written to disk before returning. Further, you can require that the data also be successfully written to n replication slaves.

MongoDB drivers also support the ability to read from slaves. This can be done on a connection, database, collection or even query level and the drivers handle sending the right queries to the right slaves, but there is no guarantee of consistency (unless you are using the option to write to all slaves before returning). In contrast Cassandra queries go to every node and the most up to date column is returned (based on the timestamp value).

Cassandra has much more advanced support for replication by being aware of the network topology. The server can be set to use a specific consistency level to ensure that queries are replicated locally, or to remote data centres. This means you can let Cassandra handle redundancy across nodes where it is aware of which rack and data centre those nodes are on. Cassandra can also monitor nodes and route queries away from “slow” responding nodes.

The only disadvantage with Cassandra is that these settings are done on a node level with configuration files whereas MongoDB allows very granular ad-hoc control down the query level through driver options which can be called in code at run time.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s