November 4th, 2013

By 

 

Jason Atlas: Vice President of Engineering and Technology at Internet Identity (IID)

 

 

Thanks Jason for sharing your use case with our audience. Can you please explain what you folks do at IID?

We help provide defensive and preventative measures to protect our customers against being exploited, being hacked or having their property stolen online. We accomplish that by sharing information, compiling and comparing data, analyzing it and then performing actions on it. We conduct different types of security operations from threat mitigations to tackling more advanced threat investigations and analysis.

 

For more than a decade we’ve focused on mitigating internet stress by working with companies, governments, law enforcement and service providers. We recently expanded operations to include security information sharing and do business with all five of the top financial companies.

 

Do customers purchase your software stack and install it themselves or do you operate that?

It is ours; it is a software-to-service. Essentially the customer knows what they want us to run, and it’s too complicated for them to execute. We have DNS centers out there; we have collection systems out there and some pretty weird internal hardcore infrastructure elements that collect company data and send it back. We have to register with a lot of people so that they can know that we are actually one of the good guys. So we handle all that for our customers.

 

What does your infrastructure look like?

We operate a permutated combination. When our foray began we used what I would consider to be well-aged on-premises technology, dedicated technologies. With the global scale we need to address, we realized our legacy database could not handle the growth, so we looked at DataStax and Cassandra as our first foray into the NoSQL world. Cassandra performed the best during our real-world proof of concept testing and we now have migrated one of our entire products entirely off of the standard relational database and web API stack into the DataStax Cassandra deployment.

 

What were some of the business and technology factors that pushed you in this direction, towards NoSQL and what specifically to Cassandra and DataStax Enterprise?

We migrated for the same reason that a lot of people select NoSQL, which largely deals with scale of data. At a certain point in time relational databases fall down. For example, when you try to stuff a lot of IPV4 data into a RDBMS table and then do a query against it, it has problems. Speed and scale are very critical to what we are doing and we are dealing with very large data sets.

 

The other thing that I very much liked about DataStax as opposed to just the pure Cassandra model is its clean integration into MapReduce because we are looking at using that part of the stack as well for analytics and things like data segmentation, smart, caching and things along those lines as well as the normal heavy computational frameworks. Then in addition, the OpsCenter console was a massive management piece because it gives us the ability to ride that level of monitoring and maintenance that we normally only stream in or see in our business activity monitoring segment.

 

Finally, the cost effectiveness is very compelling compared to relational databases. If I want to cluster a MySQL, it’s 30,000 dollars a machine. If I wanted to go with Oracle you are now talking about 40 to 75,000 dollars per core.

 

By contrast, you guys are costing me…well, a lot less. If I were to build my application infrastructure on a relational system, it would require an ungodly amount of resources. In the past I’ve hired a team of 200 developers and we needed 15 DBAs. I have two now. It is a different game from a price policy perspective, as well as everything else like TCO, manageability, etc.

 

You mentioned that you had moved off of a relational stack. Which RDBMS did you leave and which NoSQL vendors did you evaluate during the proof of concepts?

We migrated from a MySQL cluster, and then for NoSQL we looked at Cloudera, CouchBase and MongoDB. I had already played with a number of other ones at previous companies, so those are the ones I evaluated for this particular instance.

 

So, scalability is critical to you. How much data do you handle, and where could it go from here?

Our first product handles about 38 terabytes of data and that’s IPV4 information and all of its associated metadata being stored to retrieve and generate reports and send out information to people. We are moving into DNS and we select every DNS name on the planet and all of its attributes and we store them and process them in a massive feed. Our largest dataset is I believe a tenth of a petabyte. I’m looking at my capacity model over the next three years and it leads me up to a half petabyte.

 

How else do you use your data?

Our Signal product is used by one of the big three banks. Basically, a company will tell us that they own a certain IP range and want us to notify them immediately if anything bad happens within that range. Using the information we collect, we operate logic within the Cassandra portion of DataStax to constantly filter through data and determine if anything bad is happening.

We also produce a report for governmental agencies, about 27 of them now that highlights essential IP attack activity of the day and lets these people then consume their own type of analytic systems to act on that. They use our intelligence potentially provided by the system to help their own perimeter defenses and to retract and respond against that.

 

I would imagine this type of service needs continuous availability.

Absolutely. Right now we are running a version of it on DataStax and we’ve only had one node go down, and we have a feeling that was actually due more to AWS than it was due to Cassandra. We’ve not lost our ability to have constant uptime since we migrated.

 

Do you plan to use the search and/or the Solr features in DataStax Enterprise?

We need to. Now, this isn’t a use case right now exactly but it’s a natural use case; externally no one can query the IP system in real time and say, “Hey tell me what alerts are going on.” Down the road we are going to provide that functionality so we need DataStax’s Solr integration.

 

Can you discuss the move from relational to NoSQL technology because people are always curious to hear about the learning curve and receive some tips?

I believe first and foremost there is no simple answer to your question because NoSQL is not NoSQL, it is a terrible term. Would you consider yourself even vaguely the same type of solution as Mongo?

 

No, not at all.

You see, the learning curve for NoSQL is more about actually how you build your data model as opposed to how you computationally interact with the system. Also I do not believe that NoSQL completely replaces the relational database.

 

But, you don’t try to fit a relational database model into every datastore. It is a mindset concept. It is not hard but it is not a question of difficulty, it is a question of rethinking the problem and that’s what people need to learn more than they need to learn about new technology. The new technology isn’t scary. It is having you re-approach things so you don’t just dump a relational database on to a NoSQL one because its very nature is distributed and so that whole concept of distributed data models versus highly schematized and localized data models is the biggest shift.

 

Thanks again for sharing your use case with us. Any final thoughts?

Basically the main ones that I would relate are NoSQL does not mean NoSQL. Figure out what your use cases are before you do your technology investigation because, for example, a document store is very different than a key value store. Secondly for people looking at especially the high performance high velocity types of systems, I’d say look at Cassandra’s linear progression of scale, which is rarely if ever seen in the industry as you know. Fully lineary processor scale power is unheard of without efficiency loss as you add machines; but your solution has this colossal vertical scale curve that I’ve never ever seen from a data growth perspective.

 

Vote on Hacker News