Oleksii Mandrychenko: Senior Software Engineer at ZoneFox
TL;DR: ZoneFox is next generation data protection software that helps their users protect their business-critical assets; data and Intellectual Property. Having a proven track record of protecting their customer’s reputation, sales revenue, and competitive advantage through real-time alerting, and offline forensic search capabilities.
ZoneFox consists of a lightweight agent that streams data from the endpoint under protection to a centralised set of data analysis and storage components. ZoneFox initially tried standard relational SQL technologies, MS SQL in particular. But the license price, difficulty to set up, and inability to cope with large volume of data, were really heavy limiting factors. They dropped this in favour of NoSQL technologies, evaluating HBASE, RavenDB and Cassandra.
“Cassandra clearly won the battle” forming the current backbone of Zonefox’s database/persistence components. At the moment each of their customers gets a cluster of Cassandra nodes ranging from 2 to 10 nodes. Each node has a 600 GB SSD drive for memory tables and 4 TB commodity hard drives for the data storage.
What does ZoneFox do and what is your role there?
ZoneFox is next generation data protection software that helps its users to protect their business-critical assets; data and Intellectual Property.
Launched in 2010, ZoneFox has spent 10 person-years developing the product to a point where it has a proven track record of protecting our customer’s reputation, sales revenue, and competitive advantage through real-time alerting, and offline forensic search capabilities.
I am working as a senior software engineer ensuring our customers getting value from the product. I spend much time designing and prototyping parts of the system. Some of this work involves testing new technologies in order to solve our problems.
How are you using Apache Cassandra?
ZoneFox consists of a lightweight agent that streams data from the endpoint under protection to a centralised set of data analysis and storage components. We collect two major forms of data; events which describe any user interaction on the end-point and alerts that are descriptions of malicious behaviour that match events coming from the end point. Cassandra forms the backbone of our database/persistence components.
For a small organisation we get around 20 million of events a day which we store in a Cassandra cluster. We also store alerts, but the number of alerts is relatively small. Alerts are stored next to the events purely for co-location and ease of analysis.
At the moment each of our customers gets a cluster of Cassandra nodes ranging from 2 to 10 nodes. Each node has a 600 GB SSD drive for memory tables and 4 TB commodity hard drives for the data storage. Machines typically have 4 cores per node and 16 GB of memory. This is all built on Ubuntu.
We don’t typically use the latest version of Cassandra, at the moment we are on the version 1.2
With this set up we found that Cassandra only consumes 25% or less of the resources, Therefore we deployed Apache Hadoop Map-Reduce on the same nodes. These jobs run every day at night-time.
What was the motivation for using Cassandra and what other technologies was it evaluated against?
We initially tried standard relational SQL technologies, MS SQL in particular. But the license price, difficulty to set it up as a cluster, relational model, and inability to cope with large volume of data, were really heavy limiting factors. So we dropped this in favour of NoSQL technologies, which were starting to emerge as viable alternatives.
We evaluated HBASE, RavenDB and Cassandra. HBASE was difficult to set up and configure, it also didn’t have well implemented drivers for C#, which was our main development language. We also tried RavenDB, but it just didn’t scale well and was slow servicing queries. Maybe we didn’t spend much time on the both technologies and didn’t tune all the configuration settings these databases had to offer, but Cassandra clearly won the battle. It was simple to setup and use. Plus there were several relatively mature drivers for C#.
Can you share some insight on what your deployment looks like?
We usually keep data local to the client, this is due to the sensitivity of the data. Not many clients are prepared to keep all the secrets in the cloud. For some of the testing we use Amazon EC2, we also use internal virtual machines with Cassandra nodes.
The data usually sits in a single DC. For testing we normally run 3 nodes, with different replication strategies for events and alerts. To the date our largest data cluster was about 8 TBs. And the way we build our queries allowed us to search all data in constant time, which was really neat!
What advice do you have for those just getting started with Cassandra?
Get your head around column-oriented database. I can recommend a book by Pramod Sadalage and Martin Fowler called “NoSQL distilled”, it’s got really concise overview of the technologies. Make sure to read some blog posts around Cassandra. Don’t be afraid to give it a try, it’s really easy to use.
Be prepared to forget everything about data normalisation. De-normalisation is your friend.
Think about your queries beforehand. When the data is de-normalised, you are no longer able to easily change a query. But once you know what you need, the query time is usually constant, even if you store PBs of data.
What’s your experience with the Apache Cassandra community?
Not had much, documentation is certainly much better than some of the competitive products, API of the drivers doesn’t change much so I guess it’s all good work!
Anything else that you’d like to add?
Be objective. Something that works for us, may not for every other project. Spend a bit of time on several technologies and evaluate the one that works for you. Despite that, our experience with Cassandra was absolutely positive.