ZoomInfo is looking to map out the corporate world. Our mission is corporate data, which means we want to understand the companies that are out there, we want to know what they’re doing, we want to know who’s working there and how to get in touch with them.
So for example, for company data we want to know what’s the company’s name, URL, address, industry, how many employees, who works there, their job titles are, and what’s their work contact info.
In really a number of ways, we’ve been using Cassandra for three years now, we started on version 0.7.6. We had a more proprietary data store, every time we crawl a website, we do a lot of crawling in the web to gather much of our data, and every time we crawl a website, we cache a local copy of the text. You can imagine over thirteen years of crawling websites that that gets to be a pretty big mass of data.
We used to have a proprietary database within which we stored all this, but about three years ago that thing was really creaking and showing its age. It wasn’t really meant for the scale we were working on. So our first venture into Cassandra was migrating all of that cache web data from our proprietary database into Cassandra. Which is currently a four node cluster, going on to 6 nodes, with roughly 25 terabytes.
Behind that we have our data systems which hold our company and person data, our data about companies and our data about people. We migrated from other storage platforms in each of those over to Cassandra as well. We evaluated some other options, we did the company one first. We evaluated some other big data stores at the time, but based on our usage profile, Cassandra made the most sense.
We looked at HBase and we also looked at MongoDB, and MongoDB didn’t work for us because, at least at the time, it seemed that that storage solution was only atomic basically at the column level. They had a size limit of 25 or 50 megabytes, but at any rate it was telling us it could only be atomic at that level and some of our data sets are bigger than that, so that really wasn’t going to work for us. One of the reasons we were migrating from our previous store was it wasn’t perfectly atomic. HBase didn’t work for us either, it’s a great system, but our usage profile’s very much random access. We call up our data based on what new data we find and that follows a pretty random pattern, so HBase, which performs better over sweeps of data, really wasn’t going to fit our more random access usage pattern.
We run Cassandra on our own servers in a data center, actually. We did the company data system with Cassandra first and we more recently did the person data system, and that was going to be the larger undertaking because we have an order of magnitude of more data about people than we do about companies.
When it came time to choose are hardware and order it, we got the IT team in the room, we looked at the DataStax recommendations for hardware, we talked over what made the most sense, we did some math about our data volume, and really from there they ordered machines that were going to work nicely for us.
We were involved a bunch more actually earlier on in our adoption. We attended quite a few meetings of the Boston Cassandra users group. And we actually even hosted a meeting, I want to say, in 2012, and we had good experience dealing with other people. But I almost feel like we sort of settled, we understand Cassandra pretty well here.
We recently upgraded a node to 1.2.8 and I think, getting into some of the CQL stuff around there, a lot of the DataStax blog, Planet Cassandra blog, I was able to dig around and get some info that was really helpful. Reporting back bugs or issues I’ve found, getting feedback on those, it’s been really helpful as well.
At ZoomInfo, we are continuously looking for talented software engineers that are interested in solving challenging and interesting Big Data problems. All of our current job listings can be found at: http://www.zoominfo.com/business/about/careers