Top Posts This Month
Upcoming Webinars
Global Posts

How Whitepages turned the phone book into a graph

December 3, 2014

By 

How Whitepages turned the phone book into a graph” was written by Jean Villedieu, Founder of Linkurio.us.

Whitepages is now offering developers access to a Graph API. This is the final step in a complete overhaul of the way Whitepage structures its data. Here is the story of how Whitepages switched to graph technologies and transformed its business.

Phone books and the failures of RDBMS

If you were born in the 1990’s or earlier, you are familiar with phone books. These books listed the phone numbers of the people living in a given area. When you wanted to contact someone you knew the name of, the phone book could help you find his number. Before people switched phones regularly and stopped caring about having a landline, this was important.

Phone book

Phone book : a prankster’s best friend.

Needless to say the phone book was a limited tool. It was impossible to use it to find who a certain number belonged too for example. The data was here but hard to extract unless you were Rain Man.

Actually, until 2008 Whitepages was still using the digital equivalent of phone books to store its data. The Whitepages engineering team recently shared the story of its journey into graph databases. It turns out that until 2008, the company used multiple RDBMS silos (PostgreSQL 7.4, 8.0, MySQL, Oracle), each silo stored a large flat listing of data. Good old tables.

The paradox is that while Whitepages relied on RDBMS its data is naturally a graph. People or businesses are connected to addresses and phone numbers. Moreover, in the real world many people or businesses can be connected to the same address and some phone numbers are shared. Of course this can be modeled with tables in a relational database but it leads to serious issues.

The phone book addressed one use case : finding someone’s number given his name. The relational stack used by Whitepages could answer that type of query but struggled with more advanced questions like :

  • Who is behind a name? What are the present and historical addresses and phone numbers of a new customer?
  • What businesses exist in my area? Can I get a list of all the restaurant owner in Chicago with addresses and phone numbers?
  • Who does this phone number belong to? Who is the owner, where does he live and is it a telemarketer or a legitimate individual?

Whitepages customers needed answers to these questions to find new leads, identify potential fraudsters or update customer listings. The relational technologies struggled to answer questions that necessitated to look for connections in the data. According to ProgrammableWeb :

WhitePages started building their Contact Graph platform after they saw regular visitors to WhitePages.com coming from IP addresses associated with businesses.  WhitePages found that business teams like call centers and fraud departments were relying on their web site.  Call centers wanted to know how to quickly spell a caller’s name.  Fraud departments checked to see if a customer name was really associated with a phone number and shipping address.

From 2008 to 2013, a team led by Devin Ben-Hur, Senior Architect at Whitepages tried different solutions to solve that problem. Solr, HBase, Scala, Kraken were tried but the problem remained :  providing fast answers to customers looking for the connections within the Whitepages data.

Building the Contact Graph

Whitepages needed something that could match its strong requirements :

  • Scalable — Distributed solution; just add nodes ;
  • Available — AP design; robust fault-tolerance ;
  • High performance  —  > 30,000 vertices/sec ;
  • High ingest rate — 200+ updates/sec ;

The system would have to support a dataset that is naturally connected. It would also have to support queries that are centered on the exploration of relationships between entities. Finally it would have to be agile enough to adapt to new business/customer requirements.

The team led by Devin Ben-Hur settled on Titan. Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.

Whitepages decied to use Titan together with Cassandra. It tested this architecture in 3 steps :

  • Local Deployment (Single node Cluster on a Mac) ;
  • Small Cluster (5 nodes in AWS, 7.5-10 GB of data) ;
  • “Full” Cluster (60 nodes in AWS, 3 TB of data) ;

The test proved highly conclusive with 400 requests per second Titan delivered the results in 47 ms whereas the old system took 140 ms. Adding 200 writes simultaneously barely impacted the performances.

Whitepages is moving to Titan to handle a billion+ entities. When the project will be over, the Whitepages graph will store the most comprehensive and accurate data for people and businesses in North America, including the best mobile data available anywhere.

Using a graph database like Titan allows Whitepages to use a more natural model to store its data.

Visualizing the Graph built by Whitepages

Visualizing the Graph built by Whitepages

In a graph database, the entities are stored as individual nodes. Here we see a graph with 2 phone nodes, 3 person nodes and 2 addresses nodes. They are linked together by relationships. For example Jane Smith is linked to a current address and to a previous address.

Simply by looking at the graph visualization above we see the Smith household is located in Seattle and is composed of two parents and (most probably) their girl. It is simply a matter of following relationships. Querying the data (what is called a “graph traversal”) is similar and just as easy. Finding a phone number, the neighbors or spouse of someone is a matter of looking for specific relationships within the graph.

The actual graph schema used by Whitepages is more sophisticated. It is designed to handle merged entities or out of order updates.

Whitepages Graph Database Schema

Whitepages Graph Database Schema

The impact of moving to a graph database to support its business is huge for Whitepages. It is not simply a new way to offer the same products and services with better performances. This is illustrated by the fact that the graph infrastructure used at Whitepages is now available to anyone. Whitepages has released WhitePages PRO API 2.0, an API that makes the Contact Graph available to anyone.

A new class of products and services

Whitepages is embracing an open philosophy and making its data available to developers everywhere. An overview of the WhitePages API is publicly available and accessing the API only requires an email verification.

Now WhitePages is servicing an increasing number of customers via its API. The Contact Graph is used by these customers for

  • fraud prevention ;
  • lead qualification and contact data completion ;
  • identity verification and normalization ;
  • personalization and retargeting.

Using a flexible, easy to understand and high performance graph back-end allowed Whitepages to turn its 17 year experience in the data collection business into a platform used by companies like Amazon, Microsoft or Twilio.

Whitepages is following other companies betting heavily on graphs. Last year, Facebook announced Graph Search, a search engine smart enough to understand the connections within the users of Facebook, places, tastes, etc. Crunchbase recently announced the Business Graph. It is now using the Neo4j graph database to make its data about startups more easily available to the world.

Whitepages had a challenge : storing hundreds of millions of relationships between various entities and finding connections quickly in this dataset. Like more and more companies, it turned to a graph database to solve this issue. The result were good enough to enable Whitepages to give direct access to its data through a public API. It shows how graph technologies create new opportunities to find value in already existing data.

Lars Fronius Site Reliability Engineer at EyeEm
"Mainly we needed a database that would allow us to easily scale out the more data we add to it... we strive for having elastic data storages in the cloud that allow us to replace failing nodes easily, without affecting application uptime"
Lars Fronius Site Reliability Engineer at EyeEm
EyeEm is a photo community and worlds premier marketplace for mobile photography.
 
I am the Site Reliability Engineer at EyeEm. I take care of performance and stability in the system with an always increasing user base.
 
EyeEm App
Predictable scale
Mainly we needed a database that would allow us to easily scale out the more data we add to it. The more recent products we built have a higher data demand than we had before and thus we needed a database supporting those. Also we strive for having elastic data storages in the cloud that allow us to replace failing nodes easily, without affecting application uptime. We had a really good time with this pattern on Elasticsearch and learned from it, that it should be a requirement for databases we want to add in the future.
 
We were looking at DynamoDB and Couchbase as well as Apache Cassandra. DynamoDB was schema-wise unflexible at that point for our data structure – Couchbase seemed not be resilient enough. Cassandra has customers that have proven it’s stability and the schemas it allows are quite flexible and fit our use cases. 
Photo search

Cassandra supports applications of ours that are decoupled from our main application running on MySQL.  For instance our machine learning team (see TechCrunch’s EyeEm’s Algorithms Are Learning What Makes A Photo Great) has the demand to store big amounts of data that is denormalised and does not directly need to be accessed from our core community product. Also we migrated big amounts of data, that we need to make more sense of our photos in our photo search, that is not directly coupled to and queried by our main community application and thus could easily migrated off of it in order to keep the main database small and agile. The part of our application indexing the photos into our search is easily able to fetch the needed denormalised data of Cassandra and we have less of an headache scaling our main database.

An easy migration
We are running the 2.1.0 version on two 5-node clusters at the moment. We are able to boot up clusters for stagings through our infrastructure deployment tool we built around Amazon Cloudformation. Our data models so far are mostly composite key tables, sometimes with a timestamp as an ordering key.

Cassandra gives us a resilient data storage, that allows operations to easily scale up and down based on usage.  The CQL interface brought an easy migration path in terms of not having to relearn a query language and Cassandra supports us in not allowing schemas that would not scale on any database anyway.

PHP Cassandra library
Make sure to read the documentation or maybe a book on the technology behind Cassandra and how to model it.  Apart from that, Cassandra itself supports you pretty well when it comes to designing the schemas.
 

There was a lot of documentation on various blogs and it was easy to find some guidance on twitter as well. That said, there wasn’t much support on the PHP side of things, that we needed for our main product.

 
So we came up with a solution ourselves; check out our PHP Cassandra library on Github.
Andrés Velasco Collado Web Architect at BrainSINS
"We receive thousands of data points per second and we need to react to those inputs in real time and needed a robust data management system that is able to scale and process huge amounts of information. Cassandra helps us achieve that and the reliability we need for our infrastructure."
Andrés Velasco Collado Web Architect at BrainSINS

BrainSINS is a 360º personalization solution for eCommerce, actually working for hundreds of eCommerce websites such as Toys’R Us, Mothercare, Caterpillar, etc. toysrus.ca

Our solution has been designed to increase the online sales and improving the online shopping experience. On average, we help our customers to increase their sales by  20% using a combination of the products we offer. Our personalization suite includes personalized recommendations, an advanced cart abandonment recovery system, an in-site behavioral targeting solution and a set of gamification features.

From the lines above you could imagine the whole bunch of technologies we use to achieve our objectives, so in order to work in this awesome team you need to be a full-stack developer. I work as a web architect, leading the definition and implementation of our web architecture, and also supporting our backend developers as a sysadmin regarding our NoSQL architecture.

Full personalization

To provide full personalization to our customers, we need to track every possible action the users perform in the online stores: visits to products, visits to any other page (categories, homepage, etc.), when the user adds products to the shopping cart, when the checkout process starts, etc.

Tracking user actions is not an easy task, but it can be managed with a relational database until a specific point. Once reaching a certain limit, you can see how the performance degrades fast as hell.  From this point on, you need to start thinking what can you do to improve the overall performance of the system, without increasing the investment in infrastructure.

Vertical scaling of the relational database is out of the question because is too expensive, and it has its own limits. Horizontal scaling of relational database is a “pain in the a..”, since you have to maintain the consistency of the data among all the servers in realtime.

We receive thousands of data points per second and we need to react to those inputs in real time and needed a robust data management system that is able to scale and process huge amounts of information. Cassandra helps us achieve that and the reliability we need for our infrastructure.

Cassandra at BrainSINS

Cassandra helped us achieve two objectives:

From the business point of view, cheaper and faster machines that do the same work.
From the technical point of view, a cluster of machines with no single point of failure, synced and easy to scale.

We are currently using Cassandra v1.2.15 in a cluster built in Amazon EC2 Amazon-Linux machines. Cassandra gives support to our whole NoSQL infrastructure from the bottom of it. Currently, the cluster goes from 2 machines minimum up to a finite number regarding the traffic, it is very flexible thus it allows us to boot up new machines and join them into the ring in less than two minutes.

When a user performs an action, it goes directly to a SQS queue which feeds a group of machines that we call “workers”. These machines write into Cassandra the action performed by the user in a way that is easy for us to retrieve later. We use Astyanax to read and write into Cassandra, and we are using Cassandra file system instead of HDFS.

Brainsins deployment

Later, to generate analytics reports/data for your clients, we run Hadoop map-reduce jobs to calculate aggregate data using PIG, which is written into a SQL database that feeds the web application our clients use to setup the different services. In order to write PIG results into SQL we use Sqoop, and to automate this part we use Oozie. As you can see, we love the Apache Software Foundation.

Thanks to Cassandra’s flexibility, we have built a single machine with all these technologies that allows us to scale up/down the cluster as we need it without touching any config file.

The Cassandra, the MongoDB, & the HBase

We needed to read a lot of “Getting started with noSQL technologies” documentation in order to know the different solutions out there. We had four candidates, Cassandra, HBase, MongoDB and DynamoDB (Amazon). First of all we had to choose three of them to test:

The easiest to manage: DynamoDB.
The fastest: HBase.
The cheapest: Cassandra.
The coolest: MongoDB.

Since we already knew DynamoDB because we were using it for a small processes, we divided our developers in three teams: Team HBase ,Team Cassandra and Team MongoDB. Each team had to build a development environment with the technology assigned to them, including failure/performance tests. The throughput in HBase was a bit higher than in Cassandra, at least that was what the results of our tests, but Cassandra’s benefits, such as its scalability, allowed it to surpass HBase.

HBase was removed from the list because the single point of failure issue, and MongoDB because Cassandra’s throughput is better, so the latter and DynamoDB were the finalists. DynamoDB was easier to maintain, but it also offered more restrictions and a higher cost, so finally we decided we should go with Cassandra.

Advice & the enthusiastic community

I am sure any of you trying out Cassandra have read about having 1GB RAM min in your system, that’s true. Think about the limits of your hardware on the first exceptions see in the logs. From the sysadmin point of view, the best part of Cassandra is that you can automate every aspect of it, but you may fight a little to achieve it.

Since it is a new technology, it has not been easy to find solutions for specific problems, nevertheless the community has enthusiasm and everybody wants to contribute in order to boost Cassandra as a standard in NoSQL ecosystem.

Thank you for allowing us to share our experience, from here we encourage other teams around the world to do the same, more people using Cassandra means more people contributing with knowledge to this outstanding technology.

Ruby Driver 1.0 GA release

November 19, 2014

By 

I’m very happy to announce that the DataStax Ruby Driver 1.0 GA for Apache Cassandra and DataStax Enterprise has just been released. It has been an exciting journey, and it is only the beginning, please refer to the complete changelog for details.

Installation

You can install the driver now using RubyGems:

Or Bundler:

Quick start

Here is an quick look at using the driver:

Features

The DataStax Ruby Driver 1.0 for Apache Cassandra and DataStax Enterprise includes the following features:

Compatibility

This driver works exclusively with the Cassandra Query Language v3 (CQL3) and Cassandra’s native protocol, and supports the following software versions:

  • Apache Cassandra 1.2 and 2.0
  • DataStax Enterprise 3.1, 3.2, 4.0 and 4.5
  • Ruby (MRI) 1.9.3, 2.0 and 2.1
  • JRuby 1.7
  • Rubinius 2.2

Useful links

Finally, celebrate this release with a screencast detailing load balancing in the Ruby Driver.

Happy Coding!

Owen Kim Lead Software Engineer at PagerDuty
"We needed to be fault-tolerant to catastrophic regional failures. Cassandra's tunable replication and consistency let us define and implement these policies and be fault-tolerant in precisely how we need to be."
Owen Kim Lead Software Engineer at PagerDuty

PagerDuty is the central hub for on-call and operations dispatch. At its core, it ties together all your monitoring services into one place, manages your on-call schedules, escalation policies, and notification methods and ensures that if something is wrong in your service, the right person gets alerts so they can act quickly to resolve any issues. I personally work on the pipeline of alerts that starts with our monitoring integrations or HTTP and email API and ends with a person getting a call, SMS, email, or push notification.pagerdutyapp

Available alerts

Our business is to alert people when they’re having problems so we need a high standard for uptime of our integrations and deliverability of alerts; there’s little value is an alert service that only works sometimes. So we needed to be fault-tolerant to catastrophic regional failures. Cassandra’s tunable replication and consistency let us define and implement these policies and be fault-tolerant in precisely how we need to be.

Building on stable ground

Cassandra is essentially the platform that we built our alert pipeline on. This pipeline is broken into multiple stages and services, but each is backed by Cassandra so that we can build on the reliability it provides.

Stability is absolutely the top benefit we receive from Cassandra. It’s hard to build a stable service if the bottom of the stack isn’t stable. Cassandra functions as a solid base for our applications.

We evaluated Cassandra 3-4 years ago and found it to be the most mature and most suitable for cross-DC deployments. We’re now using several Cassandra 1.2 clusters, each with nodes, in 3 data center regions. Each cluster is 5-10 nodes with a 2-2-1 RF.

Tips and tricks

Background anti-entropy and load issues can creep up very suddenly and without warning if you’re not looking out for the right issues. Stay ahead of your depleting capacity by scaling in advance, a task that’s relatively easy to do with Cassandra.

I went to the Cassandra Summit in San Francisco this year and was really impressed by many of the speakers there. A lot of speakers were very candid about their experiences and lessons learned which were really useful for my own sake.

Apache Cassandra at Pager Duty: Watching Your Cassandra Cluster Melt

meetup_logo

1 2 3 144