Drew Johnson Head of Engineering at Aeris
"We’re really moving from a time where there are millions of devices connected to a network today to a time where there will be billions of devices connected to this network... What we were looking for in a database is a horizontally scalable data store that’s going to work and be cost effective for that change that is soon to come."
Drew Johnson Head of Engineering at Aeris

Aeris is in the machine-to-machine/internet of things space; so what we do is connect machines into a network. These machines can be as large as cars and trucks, all the way down to the little sensors sitting in a gasoline tank or parking space. We have customers, such as Hyundai and Honda, that have all of their vehicles connected to a network; all of these networked vehicles are connected via our systems.

With cars, we think about three main categories of applications: The first are ‘under the hood applications’, which is really about the automaker managing and monitoring the vehicle (monitoring all of the engine systems, all of the transmission systems, the tire systems). The second are ‘front seat applications’; these are applications that are most useful for the owner of the vehicles (the owner of the vehicle can lock and unlock the car, adjust the temperature and even remotely start the car). The third is what we call ‘backseat applications’, these are streaming applications like Pandora, Tune-In radio, things like that.

From millions to billions

We’re really moving from a time where there are millions of devices connected to a network today to a time where there will be billions of devices connected to this network. We’re seeing an order of magnitude change in the scale of the kind of offering that we’re providing. What we were looking for in a database is a horizontally scalable data store that’s going to work and be cost effective for that change that is soon to come.

For our current set of services we’re generally coming from an Oracle basis, as many companies are coming from. This company actually started ten years ago in machine-to-machine internet of things, so it’s really been a pioneer of this space. Subu and I have both been at the company less than two years and we’re part of a group in engineering that have been driving towards newer technologies, including Cassandra.

From relational to NoSQL

We really looked at all of the possibilities; the easiest path, the path of least resistance was to stick with Oracle… but we also looked at other relational databases like MySQL, PostgreSQL and those kinds of solutions.

We looked at NoSQL stores like MongoDB and even HBase. A few things that we like about Cassandra is the horizontal scalability, linear scalability, and ability to handle partitions so easily… but there are also some other specific aspects that we really like about it as well: We enjoy the no master, no single point of failure architecture (compared to some of the other NoSQL solutions).

The fact that Cassandra actually forces the separation of concern in the architecture is actually one of the things that we’re trying to get ourselves out of, as there was a lot of tightly coupled business logic in our existing Oracle systems. Oracle is very good at storing data but unfortunately it also encourages putting business logic tightly coupled with that data.

We are leveraging like many people Amazon. Our cloud strategy is to own the base, rent a peak and rent the new; so anything new that we’re developing, we’re deploying it first in Amazon and then as we understand the usage profile, we have an opportunity to migrate that into our data center and Cassandra works well with that.

One way of looking at making a choice for our data store is that you can think of a data store as a continuum in terms of latency and different aspects of the pitches of the storage system. We obviously have different needs for different applications, so if you think about real time scenarios, you’re talking about something that has a read-write latency in tens of milliseconds. In other scenarios you want to actually push the computation to the data and it becomes more of computational need; that’s when you need a data store that can satisfy the need of computation. You have all these different use cases and different scenarios and Cassandra fits into at least a few of them very well.

In scenarios where we need very low latency read and write, we use something like in-memory solutions and then, in scenarios there we need pretty high throughput write, we actually directly write to Cassandra. An example where we don’t use Cassandra is when we need both read and write latency to be low with a certain level of consistency scenario, such as certain counters that needs high consistency.

Time series

At a higher level, you can think of the type of data that we have as (in the telecom, we call it) control plane or data plane data. Control plane data captures the information exchanges between network elements exchange. And then, you have the actual data from the device to a backend system or to another device, which we call as data plane.

We have a type of data that is actually representing what is happening in the control plane, which we try to think of it as metadata or ‘data plane’. And then you have the actual data from the customer or from the device that flows across the network.
In these two classes of data, one is more of a metadata and the other one is the actual data from the devices. These are the two classes of data we are looking at.

Words of wisdom

This is actually the second company that I’ve helped bring Cassandra to and one of the big lessons at the first company is it’s actually difficult to take engineers who are really oriented around relational databases and get them working in Cassandra and thinking in a non-relational way. Taking a hardcore Oracle engineer and turning that person into a Cassandra engineer is probably more challenging than taking a Java engineer and having them work on Cassandra.

The other thing is that Cassandra is not the answer to everything. You really have to look at what exactly is your use case and figure out if Cassandra is the right answer for the questions that you’re asking.

We are using Cassandra in a bunch of areas; so anywhere that we’re looking at storing a relatively large amount of data (especially if there is a relatively heavy write aspect to that data) then Cassandra is really good and especially for time series and the TTL aspect of that time series is fantastic; we find that that actually automates a lot of the operational needs.

We’re using Cassandra in some of our core 4G telecom network elements. We’re using it from a reporting perspective for a simple kind of reporting repository. We actually built a horizontally scalable machine search infrastructure on top of Cassandra. And then we’re using it basically to build out our storage of the device data for our platform as a service, for the internet of Things.

There is a huge class of data, particularly where some of the primary aspects are high-write throughput and also access via time series. These are just directly in the sweet part of Cassandra.

Follow @twitter