May 31st, 2013

By 

Listen

“Another approach that we can use with Cassandra for, or more specifically the DataStax enterprise edition, is we can hookup Solr directly; there’s no bridge required, the work-flow dumps results into Cassandra and, boom, they have an index.”

-Ken Krugler, President at Scale Unlimited

Ken Krugler President at Scale Unlimited

 

 

This is Christian Hasker and welcome to another edition of Cassandra community podcast five-minute interviews. Today, I am delighted to have with me Ken Krugler, President of Scale Unlimited. Welcome Ken.

Thanks.

Can you tell us what Scale Unlimited does and your involvement in Cassandra?

We do consulting and training, primarily around big data. A lot of it is custom work-flow development for customers with training as well as search; we do a lot of work primarily with Solr/Lucene based search.

 

What do you do around Cassandra? Within the community, we see a lot of members using Solr/Lucene for search; could you touch on this as well?

When we need a NoSQL data store, a real one (we’re not just dumping data into a file somewhere) then we use Cassandra. We use both Cassandra and H-base; in general, Cassandra’s been easier to setup and get going. Because I’m often the first line of defense, when something like that’s happening, I will go into a company that typically lacks an ops team that’s experienced; Cassandra becomes a big win for me.

 

Typically Cassandra is used for persisting data that the client doesn’t want to use Oracle, or some other traditional database, for because they worry about scalability, performance or cost. When search is required as one of the outcomes for a work-flow, we can just send it directly into Solr; this helps us in generating indexes really quickly using a Hadoop cluster.

 

Another approach that we can use with Cassandra for, or more specifically the DataStax enterprise edition, is we can hookup Solr directly; there’s no bridge required, the work-flow dumps results into Cassandra and, boom, they have an index. The win there for them is it’s scalable, fault tolerant (because they now have multiple copies of the data) and there’s only one infrastructure stack that they have to maintain; which, for a lot of our smaller clients, is a pretty big deal for them.

 

You actually see things a little differently from how the broader community sees them, which is: usually Cassandra itself is the driver for adoption because of it’s availability and scalability; then, if developers have a need, they add Solr/Lucene on afterwards for search. It sounds like you’re actually seeing search with Solr/Lucene as the primary driver for Cassandra adoption because combining them creates it a good solution for your customers.

Right. I think that one of the reasons we see things a bit differently is that often times our clients are smaller. We’re a small consulting company and we work with a lot of startups. Typically they don’t have an existing infrastructure of anything. It’s not like they already have Cassandra and then they’re like, “Can you add a work-flow onto this?” They’re starting from scratch and they’re saying, “I think we need a solution and it looks like it needs to scale and handle big data.”

 

We’re, from the beginning, creating a work flow; if one of the components needs to be a NoSQL solution and/or they need search with the ability to scale and stay reliable, that’s where we can plug in a Cassandra + Solr solution.

 

Great. We’re nearly out of time but one last question before you go… you really specialize in work-flow. What are the common mistakes you see out there or any advice you would give to people that are thinking about putting together a big data solution from scratch?

Probably the biggest single mistake is not focusing on where the input data is coming from. In a typical work-flow project, that’s where you’re going to spend most of your time: getting access to the data that’s flowing into the system. Doing hand-to-hand combat withe DBAs that manage the Oracle databases, if that’s your original source of data.

 

Especially when data has to flow between groups within a company, that’s when you run into some serious issues; often times everyone wants to take part in the work-flow because that’s cool and sexy and interesting but it’s the grungy details of “how do I get the bytes in there so I can do something with them?” that are most important.

 

Okay great. And just real quick, you’re speaking at the Cassandra Summit 2013; Do you want to take 30 seconds and just preview your talk?

Sure. I’ll be talking about a project that’s using social media data, voluntarily provided by military personnel, and predictive analytics to try and provide input to psychologists on who’s most likely to try and commit suicide. It’s basically doing some triage to help them focus their efforts on the most at risk individuals.

 

Is it going to be a feel-good session or is going to bum us all out?

I think it’s going to be a feel-good session. It’s a case of actually using big data for  something that feels meaningful; it has real practical implications. It’s a serious problem and it’s an interesting use of Cassandra for predictive analytics.

 

We’re looking forward to having you there and thank you for joining us for this podcast today.

You’re welcome.