October 1st, 2013

“There’s little worry about scaling with Cassandra because it has been to scale linearly with writes and reads up to hundreds of nodes, it is a great technology.”

-John Berryman, Search Architect at Open Source Connections

John Berryman Search Architect at Open Source Connections



Hi. I’m joined today by John Berryman of OpenSource Connections for this five-minute interview.  Hey, John, why don’t you start us off by telling us a bit about OpenSource Connections and what you do do over there?

Historically we have been primarily a search and more specifically a Solr consultancy so we help people get up and running with the Solr search engine and make sure that their searches are returning relevant results. But more and more these days we find ourselves dealing with big data applications. And a very important part of that ecosystem is Cassandra.  Right now we’re working with a client who is using Cassandra to replace SQL-based solutions with something that scales better and is more appropriate for the task. OpenSource Connections is also very interested in the potential intersection between Cassandra and Solr itself.


You’re in a pretty privileged position to maybe comment on the NoSQL landscape.  Do you support other things in addition to Cassandra and what else do you see out there?

We are kind of at the nexus of NoSQL and Big Data: Solr itself is kind of a NoSQL solution. Cassandra is NoSQL.  Hadoop with Hbase is a different variant of NoSQL. Right now, we’re excited to see that there’s interest in having all three of these things work together.  DataStax notably has interesting solutions for the combination of Cassandra, Hadoop, and Solr so it’s a very interesting ecosystem right now.


Let’s dive into Cassandra specifically. When you’re working with clients, what are the use cases that you see that are the best fit for Apache Cassandra?

Well take our current client who is replacing part of their SQL solution with Cassandra. We want to make sure that they understand where SQL is good and where SQL falls down. With SQL you can do some pretty arbitrary inquiries, which allows for a good data analytics, but a lot of times we’re seeing clients outgrowing SQL.  If you have got time series data, if you’ve got a lot of updates coming in then you really need a different solution and Cassandra is really good here because it is so very fast at writes. As long as you have a clear idea about how you want to query the data, (and thus how to write the data to Cassandra), then you can gobble down as many writes as you can push to Cassandra. There’s little worry about scaling with Cassandra because it has been to scale linearly with writes and reads up to hundreds of nodes, it is a great technology.


And do you help your customers transition that mindset from SQL to Cassandra as well and how is that transition?  Do you help with the data model and things like that?

Absolutely. With Cassandra 2 and I guess it’s available even in 1.2 and 1.1, you have CQL which makes Cassandra much more approachable for those coming from a SQL background. The CQL is very similar to SQL so it’s an easier pill for people to swallow. I really like working with a client to help them understand Cassandra’s internal data model, the places where Cassandra excels, and how CQL maps to the internal data model.


And at OpenSource Connections you’ve been a great friend of the Cassandra community and been involved in driving attendance to summits and helping with meet-ups. Dealing with different technologies you participate in many communities; what do you like about working with and interacting with Cassandra community?

It’s a neat community.  There’s a lot of people out there are very eager to help here.   For instance I’ve hung out a lot on the IRC and there’s a lot of people like that are always there willing to help you with your questions —  Rob Coli, Russ Bradberry, Aleksey Yeschenko. It’s a really tight-knit community so I like working with everyone.


So lastly, John, what advice would you give to someone looking to make this transition from a relational-only world to picking up Apache Cassandra?

The answer to any sufficiently difficult question like this is always ‘it depends’. It’s good to understand what each technology is good at. The place where Cassandra excels is with time series stuff.  Cassandra is excellent when you understand up front how the data will be queried. And Cassandra is good when you know that the data set is going to quickly outgrow a single server. When approaching a big data problem, it’s good to explore around a bit so that you have an understanding of which technologies are good at what.  We now live in a world where there are a great pallet of possible things to choose from, Cassandra is certainly an important tool in the big data pallet.


And do you find most of your customers use Cassandra in the cloud in Amazon or are you seeing them using it in physical data centers or sort of a hybridized approach?

So far it’s been in physical data centers.  If the servers are downstairs in the basement, you have more control over the exact hardware you use.  The earlier concern with cloud was that you don’t want to be shoveling your data to a hard drive that’s in a different box than the one that’s been doing the processing.  However, I suspect as we go forward this will be less and less of a concern and we’ll see a variety of physical and cloud datacenters.


Thank you very much John; you’ve written some great stuff on your blog that are valuable resources for people; you can find those here: http://www.opensourceconnections.com/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure-sets-lists-and-maps/


You’ll also be presenting this topic at a webinar in November and people can register for it here: http://learn.datastax.com/WebinarUnderstandingHowCQL3MapstoCassandrasInternalDataStructure.html