October 11th, 2013

By 

 

 

 

 

 

Hari Krishnan: Architect at PROS

Christian Hasker: Editor at Planet Cassandra, A DataStax Community Service

 

TL;DR: PROS is a Big Data software company with prescriptive analytics in their software that facilitates their customers to analyze their data and get the insights and guidance to optimize their pricing, sales, and revenue management.

 

PROS uses Cassandra as a distributed database for low latency, high throughput services that handle real time workloads comprising of hundreds of updates per second and tens of thousands of reads per second. For example, they have a real-time service that computes airline availability dynamically taking into consideration revenue control data and inventory levels that can change many hundreds of times per second. This service is queried several thousands of times per second, which translates to tens of thousands of data lookups. Our backend storage layer for this service is Cassandra.

 

For their real-time solution, PROS realized a need for a distributed cache that is highly available, easily scalable, with a master-less architecture, with near real time data replication even across data centers, and that can handle real time reads and writes. PROS evaluated Cassandra against Oracle Berkeley DB, Oracle Coherence, Terracotta, Voldemort and Redis. Apache Cassandra quite easily topped the list. 

 

Hari, thank you for joining us. If you could start off by telling us a little bit about what PROS does and what your role is there, would be great.

PROS is a Big Data software company. The Big Data science and the prescriptive analytics in our software facilitates our customers to analyze their data and get the insights and guidance to optimize their pricing, sales, and revenue management. The software stack at PROS is a combination of heavy backend jobs and real-time applications and services.

 

We have completed over 600 implementations of our solutions in more than 50 countries, across 30 plus industries. Our main industry verticals at the moment are manufacturing, distributions, services and travel. I’m an architect at PROS.  My primary focus areas are on Real-time and Big Data architectures and solutions.

 

So your platform optimizes sales for many industries, but you’re focused on the travel industry. So what kind of data are you analyzing?  It’s a type of recommendation engine, is it, for sales?

We analyze sales transactions. On the travel side you can think of it as bookings and on the B2B side these are sales transaction. We apply data science on historical data to come up with prescriptive analytics and guidance on pricing, sales and revenue controls.

 

Okay, great. If you could talk a little bit about how you are using Cassandra, what application are you using it on and what’s it doing, would be very helpful.

We use Cassandra as a distributed database for low latency, high throughput services that handle real time workloads comprising of hundreds of updates per second and tens of thousands of reads per second. For example, we have a real-time service that computes airline availability dynamically taking into consideration revenue control data and inventory levels that can change many hundreds of times per second. This service is queried several thousands of times per second, which translates to tens of thousands of data lookups. Our backend storage layer for this service is Cassandra.

                      

Some of our SaaS offerings use Cassandra as the backend store to handle a combination of real-time and Hadoop based batch workloads.

 

So, just talking about Hadoop and Cassandra side by side, you taking the data out of Cassandra and putting it into Hadoop and running Batch and analytics on that, and then does that go back into Cassandra, can you talk a little bit about that?

Yes, we use Cassandra’s Hadoop integration.  Our Hadoop jobs pull data out of Cassandra, apply job specific transformations or analysis and push data back into Cassandra. We are not using the Datastax Enterprise edition for this integration; just the open source Hadoop installation with Cassandra.

 

Okay, great. As you talked about your use case, obviously a great fit for Cassandra’s strength, how did you come to learn about Cassandra and choose it?

Back in 2010 when we were enhancing the functional and non-functional capabilities of our real-time solution, we realized a need for a distributed cache that is highly available, easily scalable, with a master-less architecture, with near real time data replication even across data centers, and that can handle real time reads and writes. We evaluated Cassandra against Oracle Berkeley DB, Oracle Coherence, Terracotta, Voldemort and Redis. In addition to all of the technical criteria that I already mentioned, we also used cost, ease-of-use and the project community backing as important evaluation criteria. Apache Cassandra quite easily topped the list. The choice has worked out quite well for us so far. We found it to be very versatile – a solid key-value pair store as well as one with rich data modeling capabilities.

 

Apache Cassandra is evolving fast and we are learning and understanding its capabilities – especially on the data modeling side. We see it as a distributed NoSql database of choice for our Big Data services and solutions.

 

You’ve talked about data modeling,  what was your background? Did you come from a relational background, and how was it picking up the data model, and what kind of advice would you pass along to other people looking to do this thing?

I come from relational as well as key-value pair NoSQL background. We were looking to replace a key-value store with something more capable on the real-time replication and data distribution. We read up quite a bit on Dynamo, the CAP theorem and eventual consistency model. Cassandra fit this model quite well and for us to move to Cassandra was fairly straightforward and technically easy. As we learnt more about data modeling capabilities, we gradually moved towards decomposing data. We started using composite columns and wide row capabilities quite heavily.

 

My advice would be; if you’re coming from a relational database background with strong ACID semantics, then take the time to understand the eventual consistency model. Understand Cassandra’s architecture very well and what it does under the hood. With Cassandra 2.0 you get lightweight transaction and triggers, but they are not the same as the traditional database transactions one might be familiar with. For example, there is no foreign key constrains available…it has to be handled by your application. Understand your use case and data access patterns clearly before modeling data with Cassandra. Read all the available documentation.

 

So, you’ve mentioned lightweights transactions and triggers, are you on 2.0, are you looking at that?

Yes, we are beginning to experiment with those features. At the moment, we are still on 1.2. We use wide rows quite a lot and there are nice optimizations in 2.0 for row caching wide rows. A lot of our applications use transactions quite a bit and the capabilities offered by 2.0 will be quite useful.

 

Okay, excellent. Can you talk a little bit about the environment that’s running Cassandra: hardware, number of nodes, stuff like that?

Sure. Most of our current production environments are hosted on prem in our customer data centers. The customer deployments, of course, vary quite a bit based on their requirements. At the moment, all of our deployments are on Linux x86/64 servers ranging anywhere from 8 cores to 24 cores. Some deployments use SSD for fast access while others use spinning discs. Some of our deployments are across 3 datacenters using as many clusters,  while others are across two datacenters using a single cluster, and the most common of course is a single data center with a single cluster. The number of nodes varies, anywhere between three and ten per data center, and the data sizes per datacenter are typically about 100 GigaBytes.

                   

We are beginning to deploy on Windows Azure for our SaaS offerings. These deployments will deal with very large distributed datasets. We are working through some platform limitations to support multi-data center deployments.

 

Okay, brilliant. Anything else that you’d like to add? This has been fantastic, thank you very much.

Sure. I think as we start looking at different Cloud offerings and platforms such as GoogleCompute Engine and Windows Azure, we would like to see better support for major cloud platforms to evolve.

                       

We deploy across multiple datacenters that typically don’t open up to each other. We have found Cassandra’s need to completely open up nodes across all datacenters to be quite limiting. That restricts us to deploying multiple rings, which have to be kept in synch. The synchronization mechanism is something that we have to develop and manage at the moment but I’m hoping something will evolve in this area as well. Having single master datacenter and multiple read-only slave datacenters is quite common. Redis does a good job of one way replication in these scenarios. Perhaps the solution in this case is a hybrid Cassandra-Redis deployment.

                        

Lastly, I really want to point out Apache Cassandra community itself; Planet Cassandra is excellent. It is a great source of information online that aggregates content from stackoverflow, twitter and slideshare. The DataStax documents are fantastic, I highly recommend everyone to take a good look at it and read through their documentation. I must also point out the IRC channel Cassandra on freenode. It has been very useful for us while we were writing a new Snitch implementation to support ring across NAT-NAT networks. So, thank you.

LinkedIn