NREL is the U.S. Department of Energy’s primary national lab for renewable energy and energy efficiency research and development. Most of us are located at this large campus here in Golden, Colorado. My job is to lead a group that does support for analysis in the areas of software development and geospatial data. We build applications, websites, data sites, and much more. The DataBus project is one of the projects that my group is working on.
NREL is an exciting project. We’re in a building here called Research Support Facility. It’s a very large building (only a few years old) and it was equipped with smart meters of every imaginable kind. We live in a laboratory. For example: when sitting at our desks, we have a sensor that measures our usage of power.
The idea for our project is to create smart buildings with sensors, take readings from all of those sensors and then do something with that data. Right now, we aggregate all of the data into our system and then analyze it with various dashboards.
You can see if you’re exceeding expected usage and change the behavior of the staff or the equipment that you have running based on that insight. In our case, we have renewable energy resources (solar panels on our roofs and things like that) and our sensors collect data on what’s coming in. We can see how those resources are impacting the overall energy balance.
The Research Support Facility is LEED certified, so we’re one of the greenest buildings in the world. We’re very proud of that and we want to make sure that we keep it up. We get a lot of insight from this data that we collect. For example, if one of our sensors is failing or having unreasonable readings, we can monitor that behavior and fix our sensors/meters when they are broken.
We looked a lot at MongoDB and they actually visited us a few times. I dabbled in it but it boils down to the operational side of things. The development side on MongoDB has always been really nice and you can get started quickly but as soon as you hit operations, there’s a lot of overhead that Cassandra doesn’t have. Once you setup Cassandra properly, a lot of it is automated for you; we’ve even had a hardware outage without taking a hit.
In the past, I’ve used Hadoop. At one point, they had a single node point of failure but I believe they resolved that. Overall, Hadoop was a bit more complex and they had a very low column family count. I don’t know if that’s still true, but I didn’t really feel like revisiting that problem at the time.
I have previously used MySQL, Oracle Sybase, MSSQL… you name it and I’ve probably used it. The first thing I dabbled with in the NoSQL space was GemFire and Hadoop. I was brought on to a different Cassandra project, later in my career. It was much easier to get started because I had been using NoSQL for a while.
We’ve written DataBus with Cassandra (and client drivers as well). We only have ~300 events per second but we’ve tested the scalability of the entire system for future expansion. We have frameworks in place and that help us and our community members visualize both our internal data and our community members data who are working on the open source project. Members of the community can actually donate their own versions of charts, etc.
They have over 80,000 sensors here running 24/7; we’re reading them constantly and shoving all of that data into DataBus. They monitor everything… for instance, this room we’re in right now measures the carbon dioxide percentage, allowing us to determine the occupancy in meetings. It’s been quite unique coming into a research facility and seeing how much they’re measuring. That’s the DataBus project right there in a nutshell: measuring as much as possible and doing analysis on all of that data.
We’re in NREL’s data center right now and we currently have 12 nodes, 4 web servers and 1 load balancer. The load balancer is our only single point of failure, but thankfully it hasn’t failed yet. We could obviously do a virtual IP with a second load balancer but haven’t really warranted it.
We’re constantly doing rolling research on the web servers. We recently tested a rolling upgrade on the Cassandra nodes but you have to upgrade to a certain version. We ended up just skipping a lot of versions and taking the whole cluster out.
We run on 16GB RAM machines and we’re probably going to scale down our machines. The original ones they purchased before I showed up were geared for 20TB drives. When I started here, I unfortunately had to tell them that the other 6 machines have 19TB wasted.
One thing to watch out for if you’re new to Cassandra is your hardware. Many times I’ve come into an organization and they’ve bought these really expensive machines with 20TB of disc. You want to buy the cheap low-end machines… we even use VMs for our web server right now.
One of the things that we did get burned by (that we’ve since fixed) are very narrow tables. If you’re not careful, narrow tables will kill you; the wide tables work much more efficiently. Memory management as well… people aren’t used to that when first getting into Cassandra. It’s scary in those aspects but mostly automated once you get control of the system.
The data modeling is definitely different. We use an open source project called PlayORM, although, it’s not really object relational mapping, even though it has the ORM at the end. That’s helped us a lot for storing some of our metadata and relational data. With Cassandra, it’s easy just to shove it in there and get it out.
Be sure to check out GCN’s article, NREL releases free, open-source energy analysis tool to learn more about Energy DataBus.
Access to the DataBus repository is available from NREL.