Monday, May 16, 2011

Tuning Cassandra cluster configuration and improving performance

If you are considering to migrate to NO-SQL solution and want to use some data store which is open-source, massively scalable, fault tolerant, peer to peer and column based, there is a high probability that you will choose Apache Cassandra. According to my understanding, Apache Cassandra edges over Apache HBase when the data node topology required is something which is more robust and peer to peer. But again, HBase provides a tight integration with Hadoop ecosystem, which in a way or other can also be achieved by Apache Cassandra.
Anyways, yaml is always available to list down the configurable parameters but just to have a checklist, couple of important factors which should be take care when configuring Cassandra cluster for production environment (cloud/non-cloud based):

Q: Do we need high availability?
A: Yes, then we require to have multiple nodes across multiple data centres. We require to configure below:

  • Endpoint Snitch: You may want to specify which cassandra nodes are in the same data centre/same rack, write in ip address, data centres to be there in cluster. We can specify octet based (RackInferring), explicit property file mentioning node-DC mapping (PropertySnitch), or even EC2 region based(EC2Snitch)
  • Replication Factor: the copies of the data we want in our cluster, less or equals to number of nodes. 1 means no copies (just original data).
  • Partitioner: controls how data is distributed across nodes. If you want to facilitate range slice queries or want to provide indexing using Lucene/Solr (over Cassandra) then you will require OrderPreserving. There is a probability of Hot Spot in your cluster when using OrderPreserving. RandomPartitioning takes care of this hot spot problem as it performs a MD5 hash over "data key" get a "location key" which spans over a range of 0..2^127.
  • Seeds: You would like to feed in ip address of other nodes in the cluster.
  • ReplicaPlacement: You may want to control where data replicas are placed (something to add up to HA), OldNetworkTopology/RackAware places first one in other DC and others (if required) in other racks of the same DC. NetworkTopology places replicas (if required by ReplicationFactor) on locations each DC and Rack inside it.
Q: Do you want some new nodes to be added dynamically to the existing cluster?
A: Yes, then you will require to set Autobootstrap to true. For consistent hashing, Initial token must evenly divide the keyspace. Specifying Initial_Token you can dictate the token responsibility of the new node in the cluster. You can calculate the initial token for kth node in a n node cluster by,
(k-1)*((2^127)/n)

Q: How to ensure consistency across distributed cassandra nodes?
A: Defined at data access layer (hector/thrift/avro), Consistency Level plays an important role in determining whether the overall cassandra cluster may serve old data or not (if replication has not happened yet). Consistency Level ANY, ONE for write and ONE for read gives better performance but at the cost of eventually consistent data. A higher data consistency can be achieved by QUORUM ((N/2)+1), LOCAL_QUORUM (local DC's (RF/2)+1), EACH_QUORUM (each DC's (RF/2)+1), and finally ALL (which blocks till read/writer is served by all).
Just to sum up, if you want strongly consistent data (consistency level for write+consistency level for read >Replication factor. QUORUM provides consistent data and availability of data in case of some node failure. If performance/latency is more important than you can lower the values of consistency level for either READ or WRITE or both.

Q: How to achieve performance SLAs?
A: Read performances can be enhanced by tuning couple of parameters:

  • ColumFamily enables you to have Key and/or Row caching on them. You can specify the number of key/rows, percentage or fraction over the whole data you want to store in cache. Key cache is lighter as they data stored in them is just keys. By default, Cassandra caches 200, 000 keys per CF. Key cache decreases a I/O to index file to figure out the row location of the data corresponding to that key. It's very productive to have key cache. Row Cache holds the entire content of a row in cache, by default it is off. The overhead of enabling row cache or over increasing it is that you may require more JVM heap of Cassandra, may adverse the performance. Also, if data's column size is small, it's good so that we can increase the size of row cache without worrying too much about memory consumption, but if columns are larger than increasing row cache size is tricky. The best way to fine tune it is to watch out "RowCacheHitRatio" exposed per CF on JMX over a sample data run and according to the utilization configure the optimal size. If "RowCacheHitRatio" is too low <20-30 (which will be in case writes are much higher than read or extremely random reads, anyhow will trigger high GC activities), then I will suggest it doesn't make sense to enable row caching. You can use nodetool cfstats to see the performance/hit raio of key/row cache. It takes time to get a good idea about the hit rate because hit rate gradually matures.
  • Read performances can also be enhanced by fine tuning the Concurrent Reads, usually the rule is 4 threads per CPU core in the cluster. The higher the values the number of threads spanned for read, if the machines have faster I/O than the usual commodity versions then we can even increase this number a bit. Increasing it too much will in turn cause high context switches and will not yield higher performance gains.
  • Try to use faster RPC client instead of wrapper over wrapped RPC layer.
  • Read performance also get seriously degraded by contention during SSTable compaction, "NoofLiveSSTables" exposed on JMX (I guess so) gives you a typical indication about how many SSTables are getting used for a CF, or else we can use cfstats sstable count to get that. If this number is quite high or increasing, than your read operation may have contention with sstable compaction. Also, the more SSTables, the more fragmented (internal and external both) your data is. Increasing the configurations like memtable_flush_after_mins (or memtable_operations_in_millons or memtable_throughtput_in_mb {total ram/(1048576*16)}) will decrease memtable flushes which decrease number of SSTables and eventually number of them be compacted. Hence, less conflicts with read operations, which will improve read performance. The larger the value of these variables, the more heap memory you require to bubble up this in memory data structure. Even, the thread priority of the compaction thread inside can be controlled which eventually decreases number of compactions, this is a JVM argument to cassandra.

Write performance can be also be tuned, but do we really require to increase the lightning fast write behaviour cassandra provides? Anyway, following are the some of the ways to achieve it,

  • We can use bulkloading API's (StorageProxy, mutation or Binary verb) to write to Cassandra instead of using one of the wrappers.
  • Decreasing the Write Consistency level will also help as less number of nodes required to be blocked for the write operation.
  • It's preferable to use separate disk drive to write commit log. I guess this avoids I/O contention of Commit Log writes with SSTable reads.
  • You can tune Concurrent_Write to increase it from defaulted 32. Concurrent Writers should be set a bit higher than Concurrent Reads, ranging between 1x to 1.5x (x=number of Concurrent Reads). You can set it around 12x of cores.
Swap memory may throw important challenges to achieve effective read/write performances in a java application. Operating System swaps pages to and from the disk even when sufficient memory available. During swap space read/write you may observe some difference in performance. You can disable the swap with swapoff command. On linux based OS, you can append vm.swappiness=0 or 5 to file /etc/sysctl.conf to reduce likelihood of OS to use swap space. Cassandra provides a memory_locking_policy parameter in yaml, you require to enable it.  JNA libraries helps you to lock JVM memory making it unevicatable to be swapped.
Performance of Memory Mapped I/O is better than regular I/O, on a 64 bit architecture you can use this performance efficient I/O by setting disk_access_mode  in yaml to mmap.

Q: Do I require to tune JVM for Cassandra cluster setup?
A: Yes, you may want to customize it. Make MAX and MIN to be same (just to avoid full GC during heap growth/heap shrink) It's expensive to grow the heap, just MIN to what you assume should be the maximum cassandra will consume, make MAX a bit higher than that for "just in case" cassandra is hit with more load. As in most cases, its better to do a full GC and grow the heap rather than get an OutOfMemory and crash. Heap size to be allocated to Cassandra, can be calculated to be approximately by, Memtable_throughtput_in_mb*3*(number of hot column families) + I GB + key cache + row cache.Cassandra default to many GC configurations which are perfectly configured for the big data entities. Just to for example, -XX:SurvivorRatio (defaulted to 8), try to retain as many objects in survivor spaces so that they can be reclaimed by GC, if we increase the size of survivor space then copying these long lived objects between survivors will trigger minor GC. Interesting balance, but better to copy more between survivors than to promote more to Tenured. Tenuring distribution can be seen using -XX:+PrintTenuringDistribution.

We have -XX:+UseParNewGC, parallel GC has ergonomics, like tuning young generation, it avoids/decrease frequency of major GCs, it avoids full GC by avoiding/decreasing promotions, maximizes the heap size. Also, we use CMS (ConcurrentMarkSweepGC), this is for tuning old generation, CMS is designed to avoid stop-the-world pauses. Number of concurrent threads in CMS can be controlled by -XX:+ParallelCMSThreads. Pauses dones during remark phases can be reduced by enabling -XX:+CMSParallelRemarkEnabled. Just to put a point if you want to use, by default classes will not be unloaded from the permanent generation when using CMS, this can be enabled by using -XX:+CMSClassUnloadingEnabled.

On 64 bit machines, on Java SE 1.6 update 14, you can utilize the compressed pointers which results in smaller heap sizes, you have to append -XX:+UseCompressedOops to enable this.

Q: Any hardware changes to boost the performance?
A: Yes sure.
  • You can utilize 10 Gigabit ethernet to increase network throughput.
  • You can increase RAM, RAM can be allocated to Cassandra JVM heap, eventually passes advantage to lot many factors.
  • Multi Channel Memory Architecture (MCMA) technology increases the transfer rate between RAM and Memory Controllers by adding channels, theoretically transfer rate gets multiplied by the number of channels. Modern high end chip sets like, i7-9x series or latest of Xeon chip sets supports triple channel memory architecture. Even quad or eight channel memory chip sets are also present. You can optimise the RAM according to the multi memory channel, you can pair DIMMs as the number of memory channels, for example, if using triple channel memory chip sets, let each server to have triplets of DIMMs. You have to cross check the MCMA support of your server's Processor as well as Motherboard, once you know this, you can pair up your DIMMs according to that supported number to extract the best performance from the RAM.
  • On master or server you typically add more disk as RAID, whereas just to avoid administrative overhead, pipelined writes, you must add more disk to slaves as JBOD (typically 0.8*core number of disk to get optimal performance, but must not be greater than 1*core number of disk).
  • You can use fast seeking disks, Solid State Drives (SSD), SCSI systems are also better in performance. Typically, SATA drivers are recommended over SAS drives.
  • If you are setting filesystem as ext3 (you can disable the noatime, so that administrative overhead can be reduced) Changes require to done in /etc/fstab and do a mount -o remount /. If you have option to use other file system, you can use faster filesystems like ext4 and XFS, which also supports bigger file sizes and bigger volumes.
  • You can club multiple network cards to achieve high performance.