Dec 20, 2012

mongoDB Sharding

If I should have made some safe bets on the near future, I would choose two: Hadoop and mongoDB. 

There is a huge demand for both technologies and many players consider these technologies as a foundation for their future products.

MySQL Sharding was a major issue for large scale installations and it is the same for mongoDB large installations.

Back to Basics
mongo is pretty similar to a regular database, but it has two main advantages: 1) Software engineers love it as it can easily be used for object persist-ency and 2) it support unstructured objects (documents) that can easily store different objects based on the same virtual class.

mongoDB terms

  1. Database: database
  2. Collections: very similar to tables.
  3. Documents: very similar to rows. Yet, a document can be as flexible as a JSON document can be. For example, it may include 1 to many fields in the document itself.
  4. mongod: a mongoDB instance or shard.
  5. Chunk: a 64MB storage unit that stores documents.
  6. Config database: Chunks to mongos mapping directory.
Why use sharding?
  1. Support large dataset using commodity servers.
  2. Support high IO requirements using commodity disks.
What are mongoDB sharding features?

  1. Range-based Data Partitioning: a very similar method to MySQL partitioning. You should choose one or more fields (shard key) that sharding will be based on. You should choose a shard key according to the business logic, like splitting according to account id in a SaaS application.
  2. Automatic Data Volume Distribution: mongoDB will take care of the shards balancing by itself according to the chosen shard key.
  3. Transparent Query Routing: mongoDB takes care of queries map reduce to multiple shared by itself when a query does not match the shard key (very much like Hadoop).
Key Recommendations for mongoDB Sharding
  1. Sufficient Carnality: choose a shard key that can be split later to more shards if a database size is getting too large (exceeds chunk size).
  2. Uniform Distribution: choose a sharding key that will spread a in uniform distribution to avoid unbalanced design.
  3. Distribute Write Operations: if you have a billing system, prefer to shard according to account id rather than shard according to billing month. Otherwise, in a given day, probably only a single shard will be used.
  4. Query according to the shard key: if any of your queries will include the shard key, each of your queries will result in a single shard query. Otherwise, it will generate N queries (one per shard).
Technical Aspects for mongoDB Sharding
  1. Every sharded collection must have an index that its first fields are the shard key (use shardCollection for that).
  2. Chunk size default limit is 64MB
  3. When a chunk reaches this limit, mongoDB will split it to two.
  4. If chunks are not distributed uniformly, mongoDB will start migrating chunks between different mongos.
  5. Cluster Balancer is taking care of this process.
  6. Balancing can cause performance issues and therefore can be restricted to off peak hours (nights and weekends for example) using balancing windows.
  7. The shards mapping to mongos is saved at the config database.
  8. Replication should be considered as well  a complementary method.
Bottom Line
mongoDB brings to the table an out of the box sharding solution that can scale your operations. Now, you only need to analyze your needs and select the right solution for them.

Keep Performing,

Dec 13, 2012

MySQL Crash Course Presentation

In the last few weeks I lectured a MySQL crash course. The course topics covered almost all what is needed to make an initial ramp up when you get into MySQL: ERD, DDL, DML, installation, security, scaling, backup, Schema design, tuning, master slave and more...

The good news
I got a very good feedback from the students, so I decided to share with you the presentation itself:

Keep Performing,
Moshe Kaplan

Dec 9, 2012

How to use rsync for high availability environments?

What if...
  • What if I have a large number of web servers and I need to deploy the same code on all of them?
  • What if  I would like to enable high availability and redundancy for static user content such as images?
  • What if I want to to backup files to a central storage?

A Swiss knife for static content replication
rsync was considered for a long time as the best solution for static content and code replication  in environments that consist of large number of servers.

rsync has a simple protocol that replicates a directory (one or more) on a single server to other servers. This can be achieved in two different methods (like SCP that it is based on):

  • Push from the master to the slave: rsync [OPTION] … SRC [SRC][USER@]
  • Pull from the server by the slave: rsync [OPTION][USER@]HOST:SRC [DEST]

Can I perform a change on the destination directory?
Please note that the rsync protocol analyzes differences between two directories, and therefore probably will not match cases when you want to change the content of the destination directory.

How should I authenticate?
Use one of the two options:

  1. Static user/pwd using sshpass for non interactive SSH based authentication.
  2. PKI authentication using on-the-fly keys generation or pre-generated keys
Master-Master replication
Like in MySQL, Master-Master replication can be achieved by a dual Master-Slave connections setup . Please consider to enable only one of these connections. Then, during a failover, disable the replication. Last, when you bring the master server back enable the other replication.

Note: you may consider using OpenStack Storage for these purposes as well, as it provides an out the box solution for high availability and redundancy that easily supports multi master out of the box

Keep Performing,
Moshe Kaplan


Intense Debate Comments

Ratings and Recommendations