Data Locus: November 2014

In this post I want to explore the topic of distributed storage by comparing two different products: Apache Cassandra, an open source operational database system and Hp Vertica a proprietary analytic database system.

A distributed database is a computer network where information is distributed and stored in several nodes.

Cassandra is a column-oriented distributed storage system with no single point of failure capable of scaling out to hundred an even thousand of nodes, highly available and resilient to network splits. It is designed to support a high ratio of random writes without sacrificing the read efficiency providing eventual consistency.

I've had previous experience working with Cassandra but recently I became involved in a project where we needed a distributed storage solution to store time series and perform analytic queries on the data. While comparing different products we came across Hp Vertica.

Vertica is the commercial version of the C-Store columnar storage system, an academic collaboration between different universities.

What sets Vertica apart from the new batch of NoSql distributed storage solutions, is that it is actually an RDBMS, ACID and fully SQL compliant and capable of scale out to large clusters. There are deployed production clusters in the range of hundreds of nodes and over the petabyte size. Vertica is an analytic database, and as such is designed to ingest great amounts of data in batch processes and support a relatively low number of very complex read transactions.

The two products are share-nothing, elastic, scale-out architectures.

The Consistent Hash Ring

When we think about a distributed storage solution, our first step is to define the sharding strategy we are going to use to split the data among nodes of the cluster.

The simpler solution could be to arbitrarily map ranges of keys to nodes. This is actually a working approach used in some systems but it has disadvantages, namely the need to store and maintain metadata tables (one per objetct/table/collection key type) with the mapping of ranges to nodes. It would probably need a central point of coordination.

We can also simply apply the function -key value- modulo -number of nodes- and the result would be the node containing the key. Since we need an integer value and also to avoid hotspots caused by the skew of our key distribution first we will need to apply a hash function with a good mixing behavior to the key: hash(key value) modulo -number of nodes-. A good candidate can be MD5 hash. Now we have a good performing algorithm that distributes the data evenly in our cluster.

What happens if we add a new requirement?. We need our cluster to be elastic, nodes can be added or removed as needed and the data must be re-balanced in such cases. With our current strategy when we add or remove nodes, we will need to re-hash an move a huge amount of data. We need a strategy decoupled from the number of nodes.

We already introduced the hash function to make or solution independent of the key types and skew. Now we introduce the concept of consistent hashing: we to apply the same hash function we use for the key value to the node id (the id can be an unique property like the ip address) . We map each node as a value in the hash range, since this range, although huge, is limited we can handle it as if it wraps up in a circular way forming a ring: when we reach the maximum value the next one is 0. We have our nodes represented as points inside the ring. To find the node corresponding to a key, we apply the hash function to the key so we get also the point representation of the key in the ring, then we move clockwise until we find the following node point and that's the one storing the data corresponding to this key. The consistency means that when we add or remove a node we only need to relocate the data stored on it.

This solution is considered the canonical consistent hash ring strategy. The node is randomly assigned a point in the ring by applying the hash function to the node id, the risk of collision with another node is negligible, the trade off is that we don't know what will be the range assigned to a node, and there could be great size differences, creating hot spots. This is particularly patent with a small number of nodes. Ranges tend to even up with a large number of nodes due to the good mixing behavior of the hash function.

The problem of load balancing in the cluster can be resolved using the virtual nodes strategy. We split the hash range or continuum in a big fixed number of slices of the same size called virtual nodes. The keys are hashed and assigned to a virtual node by applying the modulo function. The same number vnodes are assigned to the physical nodes thus avoiding hot spots. We can even assign more vnodes to the most powerful server if out cluster is not homogeneous. The mapping between nodes an vnodes is maintained in a metadata table. This is an hybrid of the two first proposals. The distribution depends on the number of vnodes, but it is a fixed value during all the life of the cluster. The mapping metadata table to maintain is unique.

Cassandra Partitioner

Cassandra initially used a partitioner to decide the node the data is stored on. (In fact the system even allows to partition the data arbirtrarily following the key ordering sequence although is strongly not recommended.

From the documentation:

"A partitioner determines how data is distributed across the nodes in the cluster (including replicas). Basically, a partitioner is a hash function for computing the token (it's hash) of a row key. Each row of data is uniquely identified by a row key and distributed across the cluster by the value of the token"

At first Cassandra followed the consistent hash ring with pseudo random generated token ids strategy. As we saw previously this solution leads to a load balancing problem in the cluster. Beginning with version 1.2 Cassandra introduces vnodes:

"Prior to version 1.2, you had to calculate and assign a single token to each node in a cluster. Each token determined the node's position in the ring and its portion of data according to its hash value. Starting in version 1.2, Cassandra allows many tokens per node. The new paradigm is called virtual nodes (vnodes). Vnodes allow each node to own a large number of small partition ranges distributed throughout the cluster. Vnodes also use consistent hashing to distribute data but using them doesn't require token generation and assignmen"

In the latest releases the virtual node strategy was adopted.

Vertica Segmentation

Vertica reserve the use of the partitioning concept to local intra-node tuple segregation to distinguish from inter-node segregation.

From the original Vertica architecture paper:

"Vertica applies a default hash function to the columns chosen as segmentation keys, This function distributes the data using a normal statistical distribution. The node on which the tuple is stored is determined by this hash. The whole range of hash values is divided between the number of nodes, and each node is assigned a range beginning with the previous node maximum and covering the maxInt/numNodes (max Integer value is 2^64) following values."

This is the second approach we described, this is dependent on the number of nodes, and any node joining or leaving the cluster force us to relocate almost all the data.

Lately Vertica introduced a new feature called elastic cluster:

"To help make data re balancing due to cluster scaling more efficient, HP Vertica locally segments data storage on each node so it can be easily moved to other nodes in the cluster. When a new node is added to the cluster, existing nodes in the cluster give up some of their data segments to populate the new node and exchange segments to keep the number of nodes that any one node depends upon to a minimum"

"The alternative to elastic cluster is to re-segment all of the data in the projection and redistribute it to all of the nodes in the database evenly any time a node is added or removed. This method requires more processing and more disk space, since it requires all of the data in all projections to essentially be dumped and reloaded."

These local segments correspond to the virtual nodes strategy.

Conclusion

It seems that we have a winner:The virtual nodes solution. Vertica and Cassandra arrived to the same strategy to distribute the data in the cluster. In fact, this solution is widely used in similar systems like Voldemort or DynamoDb.

Revisiting the CAP theorem

You are probably familiar with the CAP theorem, in brief it states that a distributed system cannot simultaneously guarantee the following properties instead it must pick two of the three and neglect the third:

Consistency. All nodes see the same data at the same time.
Availability. Every request receive a response either success or fail.
Partition. The system is resilient to network partitions and continues operating in such cases (split brain situation).

According to this, systems can be classified as CA, CP or AP.

The first time I heard of it I remember thinking that A was not possible in a real distributed systems (see the fallacies of distributed computing) and that A and P seemed to somewhat overlap.

Later I came across this interesting post by Coda Hale that re-explains the theorem in a more sensible way:

Distributed Systems are defined by the property they choose to guarantee when there is a network partition: Consistency or Availabilty. Systems can be:

CP: The system chooses consistency over availability on a network partition, I the event of a network split the system stops working or return error to all requests.

AP: The system chooses availabilty over consistency on a network partition, that means that nodes can still being giving service indepently. Once the connection is recovered, a synchronizaton mechanism restores a consistent view. Take in account an AP system is giving up strong consistency in favor of soft or eventual consistency for all its operational life to avoid a service outage in the split network exceptional situation.

The corollary is that a distributed system cannot be CA and guarante both consistency and availability in the event of a partition. Only a not-distributed system (a unique node) can be CA.

Coda Hale reasons also that availabilty is preferable over consistency in most systems since a service outage has always an economic cost and strong consistency is rarely required.

On the other hand, Michael Stonebraker one of the creators of Vertica, favors consistency over availability, his argument is also solid: the split brain situations are very rare, and in favoring availabilty over consistency in such cases you are sacrificing consistency also in nomal operational situations that are the vast majority of cases.

Cassandra is an AP system designed to be highly available, and that is a priority in an operational database system. Although it supports a tunable consistency model trading off latency for consistency, it is intended to work as an eventual consistency system and does not support row locking in any case.

It is no surprise that Vertica is a CP system, it provides an ACID consistency model. and since it is an analytical db where availability is not the priority, it seems like a right choice.

Storage Considerations

A great deal of the differences of behavior and performance come from the different local storage policies chosen, I'll leave that for a future post.

Data Locus

Saturday, November 15, 2014

Distributed Storage Concepts in Vertica and Cassandra

Vertica Segmentation

Conclusion

Revisiting the CAP theorem