Clearpool Group, a trading software and execution innovator for an evolving equity market microstructure and competitive landscape, encompasses Clearpool Execution Services, an independent agency broker-dealer, and Clearpool Technologies, developer of state-of-the-art electronic trading, routing, compliance and risk solutions.
When I joined Clearpool we started building our software from scratch. We knew we wanted to create a real time trading system with redundancy built in. Having done this in the past, developing the trading system was the easy part. The challenge came when we needed to implement the same principles with our database. In my experience, the database is always an issue in terms of redundancy. After working at a couple firms, the same scheme was implemented – have a primary and secondary and write to both at all times. . However, when it came time to leverage the backup, failover was never seamless. My theory on failover is that developers will always assume the failover happened in a very particular way which is never the case. Failures can happen in infinite ways, and that was our experience with databases. We wrote a great deal of code to deal with failover but when it was actually necessary, it didn’t really work. We knew Cassandra had failover built in so we gave it a shot
Another aspect was the write speed and the ability to scale. Writing to an Oracle database entails expensive hardware and licenses – why incur the cost?
So why Cassandra and not any other distributed database technology? I didn’t like HBase because the NameNode creates a single point of failure. If that goes down, you’re down. We didn’t want a single point of failure anywhere in our system. With Cassandra this problem is solved.
Our original cluster was for our trading system – all the orders, trades, positions, configurations and any real time update that we published. We recently launched a second cluster to store our market data. Storing all the data published on the exchanges enables us to perform historical back testing.
At this time we’re in one data center and we’ll move to multiple DC’s at a later point.. Our trading system cluster is using 1.2 and our market data cluster is using 2.09.The trading system cluster has 10 TB over 12 nodes that have 4TB drives each. The market data cluster has 5 TB over 4 nodes with 4TB drives each.. Each of nodes, regardless of the cluster, are bare metal machines that have dual CPU’s with 6 cores per CPU.. We store a large amount of data per node. Upon reflection, we would recommend storing less per node as it’s causing repair jobs to take a while.
We have two disks that are mirroring each other so we don’t lose any information. While Cassandra has impressive resiliency built in, we wanted to add another safety layer. We haven’t used SSDs yet, though it’s something I’m interested in. It wasn’t necessary because we have a 500MB disk controller in front of our disks. Anytime data is read, it’s hitting the disk controller’s memory before going to the physical disk.
At peak load, our write latency is around 300 microseconds per operation, while our read latency is around 200 microseconds per operation. What we have learned is that Cassandra performs well if you just append – insert without mutating any previous inserts. If you’re constantly appending, it’s going to work beautifully – especially for time-series data. To give added perspective, our writes on the market data cluster are on the order of hundreds of millions in the core hours of the trading day (6.5 hours) over 4 nodes, whereas our trading system cluster, which has mutations, is hovering around 50 million over 12 nodes.
We have built a robust piece of software that accesses Cassandra to help us with system failures, indexing and inserting while minimizing mutations. In general, we have written a wrapper around Astyanax’s thrift library. At first it was a challenge because we weren’t familiar with Cassandra, how it worked internally, and how to set-up a proper data model. However, over time, we got it right and it’s one of the most powerful pieces of software that we have.
If Cassandra was not available, our software will write the data to file, and persist the data back into the cluster once it comes back online. At any point in the day, if there is a large scale cluster issue, we’re not impacted – our system remains functioning in real time. In fact, the only way we know that the cluster is unavailable is due to log alerts from our adapter.
Our adapter is aware of various ways to index a piece of data. If we have a trade and we want to index by A) symbol B) symbol + trade date C) symbol + trade date + venue, all we have to do is define the different indexed fields and our adapter will, from a single insert call, write to all three tables. On the flip side, when reading that data we have a search command based on the criteria provided. You can give as many or few fields of the data that are known, and the adapter will know which of the index tables to use to return the results in the quickest manner. Lastly, we have also built an index cleanup where we can define an expiration for an index table. Perhaps we want table A from above forever, but table B + C for 45 days.
With respect to mutations: let’s say I create an order and it requires updating. I do 2 inserts in 1 ms and update again in 2ms. On the Cassandra side this isn’t optimal; it’s going to need to compact all three into one eventually. When Cassandra needs to merge these inserts into one, we are effectively adding more compaction pressure, which eat up CPU cycles and disk IO. Our software is optimized to know that it should queue inserts in memory before inserting them to Cassandra. The queue is intelligent enough to know when it’s filling up rapidly such that it doesn’t consume too much memory and it knows when data in the queue has aged long enough so that writes still remain near real time. Once that was done, we made Cassandra do significantly less work. Since we implemented this feature, we haven’t had any issues with Cassandra.
It’s all about the data model. Most people come from a traditional relational database background and will immediately want to model their data the exact way. This is perfectly natural, but it’s absolutely the wrong approach. They need to stop thinking about reducing data duplication and do the exact opposite.
With Cassandra, they don’t realize that they can write as much as they want on a single row and it can actually be better for when they need to read the data. In a traditional database, it would be unheard of to store the market data book as it looked when receiving an order. You would store the market data separately, store the order separately, and when you want to read them together you would do a join. We store all the data in real time along with any metadata that we would need when we read it no matter how redundant it is. It’s more about modeling the data based on how you would want to read it, as opposed to trying to model it for the most efficient and minimal storage use case.Writing is fast and disk is cheap.. Given this, we now have the luxury to take the overhead while writing so that reading could be simplified.
In terms of tweaking Cassandra knobs, if the data model is not correct and optimal, some users could get lost trying to tweak the wrong things when all they really need to do is change the data model. I recommend not even looking at the knobs in the Cassandra yaml file until their data model is locked down. It’s tough to know if the data model is good enough. The process is iterative.
My team and I are all Java developers using Cassandra for the first time. We needed help along the way to get it right. Searching online works, but there’s so much information out there written by people with various use cases, it wasn’t the most straightforward way. We went to Datastax conferences, went to meetups in the NYC, communicated on the email group, and even engaged with Cassandra consultants – these approaches were more effective. The best advice: check out the code. It doesn’t lie. The answer is always there.