Whitepages is the largest and most authoritative source of contact information about people and businesses in the United States. We have current plans to rapidly expand globally as well.
We have a free website that gets about 50-60 million new unique visitors a month. A number of pretty popular mobile applications that do caller ID and search and we have a B2B offering called Whitepages Pro where we sell data directly to other companies.
We have information in our data set of about 300 million people in North America and pretty much every landline. Increasingly mobile is hard to cover so that’s a big challenge for us, but we are acquiring numbers at about 50-60 million a year and trying to ramp that up by an order of magnitude this year.
This data set is driven by every search that’s done at Whitepages and often involves the retrieval of a hundred or a few hundred business objects just to return one search page. We are currently using a Postgres instance to hold most of our materialized people data. Scaling that has been expensive. It’s just a big Postgres box with a lot of memory on it.
We actually put the entire database in-memory. It’s really just expensive big boxes which is a big reason why we looked at Titan and Cassandra going forward. As your hardware expands quickly Postgres just wasn’t going to cut it anymore.
We’ll still have Postgres around at Whitepages for use cases where scaling isn’t as important. For our actual people data, what you think of as Whitepages data, that’s all going into Titan at this point and, in some ways, Solr, too.
Historically the way we have built our data has been through a large batch based process where we get a digest of huge feeds of data from other people on a monthly basis, sometimes quarterly basis. Somebody has to go cook that through a bunch of large MapReduce jobs and basically publish a new view of the world.
We’ve had a desire to migrate because we increasingly have a lot of touch points where we’re learning new information about the world on a continuous basis and gathering those events. We’ve been in the process of building a continuous reasoning pipeline around that data with the desire that if I observe a new fact about the world, like this name and this phone number and this address are associated in a certain way, I can then calculate the confidence that that is something we deem as true and if it’s above a certain threshold, publish it to our search engine.
This would let us be able to do that in essentially real-time, or at least with very low latency when we learn real information and it becomes searchable on the website. To do that, we have to be able to actually scale writes to that main store. Postgres was this large in-memory database and it would be even more expensive to scale that out on a write basis. There are all sorts of ways to do it but a lot of those ways involve building surrogate systems that are embedded in some inherently distributive storage, such as Cassandra.
The other motivation was just data modeling and data representation; that’s where Titan comes in. We had a lot of problems trying to be agile in the type of data that we return – how many people, how they link to locations and phones.
We’ve been saying for a long time that our data is inherently graph-like so we finally put our money where our mouths were and decided to actually invest in making a graph database work. It’s really been helpful in designing the shapes of data we return to our customers. We can iterate much faster now.
We had a goal to have very low latency so this cluster was sized with enough memory across the cluster to keep everything cached and in-memory. We have on the order of 20+ decent sized, SSD boxes, all in-memory with every instance running both Cassandra and Rexster, a HTTP end-point for Titan.
We evenly distribute queries across the entire cluster. We actually do use some token awareness on it so we aren’t just randomly going everywhere. We are already getting near capacity on those servers. We might need to scale up soon. But that’s just Big Data.
There are still some challenges, but we think Cassandra and Titan together are tremendous; enabling technology for us to rapidly grow our data set and increase the velocity with which we use and move about data and publish data.
Multi-data center is in the roadmap for the future, part of that desire is from our global expansion. Giving the end user a better experience by letting us move data closer to them. In addition, we would have the resiliency of not being subject to some kind of outage that affects your data center or region. So whether that happens this year, or next year, who knows, but it will happen.