December 26th, 2013



“The complexity of sharding and ensuring the availability of vast amounts of data was impractical. When evaluating systems capable of persisting this data, only Cassandra provided consistency.”

-Dave Cowen, Systems Engineer at Lucid

Dave Cowen Systems Engineer at Lucid

Tom Geer Chief Architect at Lucid


What does Lucid do and what is your role there?

Lucid is a pioneer in the concept of bringing real-time energy and water use feedback to  building managers and tenants via connected devices. We recently launched BuildingOS, the world’s first online operating system for buildings, which aggregates energy and water data from over 150 metering and building systems into one unified, network-connected source.

Our Building Dashboard product is the gold standard for user-friendly energy dashboards to provide energy utilization and building performance data, which not only provide energy data, but also facilitate energy reduction competitions.

I’m Dave Cowen, a Systems Engineer for Lucid, with Tom Geer, our Chief Architect.


How are you using Apache Cassandra?

We’re using Apache Cassandra 1.2.11 to warehouse all of the energy, weather, profile and statistical data for our clients. One of the big differentiators of Lucid’s products is that we not only collect real-time data from a wide variety of different building automation systems but also from site and energy specific metering and logging systems. The warehouse also stores historical energy utilization and weather data. This is not only aggregated for display in a friendly, easy-to-read interface, but is also used to create analytical profiles and statistical models that are related to total energy demand, consumption and cost for each specific customer’s arbitrary partitioning of their property portfolio.


What was the motivation for using Cassandra and what other technologies was it evaluated against?

An efficient application of Brewer’s CAP theorem was the focal point. Due to the nature of real-time analysis and aggregation of real-time data, consistency was critical. With the nature of meter data and issues with retention and availability at the source, availability was a close second. Partition tolerance was the final, necessary, piece of the puzzle.  

Due to the number of meters supported, their reporting frequency, and the fact that we store all of the buildings available energy use history, a traditional relational database would not suffice. The complexity of sharding and ensuring the availability of vast amounts of data was impractical. When evaluating systems capable of persisting this data, only Cassandra provided consistency. We do also use an RDBMS for managing meta-data and its relations.


Can you share some insight on what your deployment looks like?

The nature of the data we collect, combined with the processing demands of our product, require a high level of i/o throughput, making it more economical to invest in our own infrastructure than to deploy in the cloud. Our production C* cluster currently consists of 8 nodes running on modern commodity hardware from iX Systems, with persistence on local SSD-based RAID. We run at a replication factor of 5, allowing us to lose any two nodes and still maintain quorum.


What advice do you have for those just getting started with Cassandra?

When planning your initial deployment, over-plan. Resetting replication factors in a production environment is complicated and time consuming. Set the initial replication factor to a level that will be acceptable in your future environment and deploy enough nodes to support the replication factor. While it may seem like overkill, your data will grow to meet the need.


When allocating storage resources to Cassandra, it’s essential to leave enough overhead for repair and cleanup operations, which often temporarily double storage requirements in order to complete successfully. Also, even when not using Datastax Enterprise, the Datastax documentation on Cassandra is a godsend.

Vote on Hacker News