May 13th, 2013

Titan: Distributed Graph DatabasePearson is striving to accomplish the ambitious goal of providing an education to anyone, anywhere on the planet. New data processing technologies and theories in education are moving much of the learning experience into the digital space — into massive open online courses (MOOCs). Two years ago Pearson contacted Aureliusabout applying graph theory and network science to this burgeoning space. A prototype proved promising in that it added novel, automated intelligence to the online education experience. However, at the time, there did not exist scalable, open-source graph database technology in the market. It was then that Titan was forged in order to meet the requirement of representing all universities, students, their resources, courses, etc. within a single, unified graph. Moreover, beyond representation, the graph needed to be able to support sub-second, complex graph traversals (i.e. queries) while sustaining at least 1 billion transactions a day. Pearson asked Aurelius a simple question: “Can Titan be used to educate the planet?”This post is Aurelius’ answer.

Data Loading Benchmark

Pearson EducationPearson provides free online education through its OpenClass platform. OpenClass is currently in beta with adoption by ~7000 institutions. To meet the expected growth beyond beta, it is necessary to build the platform on a scalable database system.OpenClassMoreover, it is important to build the platform on a database system that can support advanced algorithms beyond simple get/put-semantics. The latter was demonstrated via the initial prototype. To the former, a simulation of a worldwide education environment was created in Titan to alleviate Pearson’s scalability concerns.

Number of students 3.47 billion
Number of teachers 183 million
Number of courses 788 million
Number of universities 1.2 million
Number of concepts 9 thousand
Number of educational artifacts ~1.5 billion
Total number of vertices 6.24 billion
Total number of edges 121 billion

The simulated world was a graph containing 6.24 billion vertices and 121 billion edges. The edges represent students enrolled in courses, people discussing content, content referencing concepts, teachers teaching courses, material contained in activity streams, universities offering courses, and so forth. Various techniques were leveraged to ensure that the generated data was consistent with a real-world instance. For example, people names were generated from sampling the cross product of the first and last names in the US Census Bureau dataset. Gaussian distributions were applied to determine how many courses a student should be enrolled in (mean of 8) and how many courses a teacher should teach (mean of 4). The course names and descriptions were drawn from the raw MIT OpenCourseWare data dumps. Course names were appended with tokens such as “101,” “1B,” “Advanced,” etc. in order to increase the diversity of the offerings. Student comments in discussions were sampled snippets of text from the electronic books provided by Project Gutenberg. University names were generated from publicly available city name and location datasets in the CommonHubData project. Finally, concepts were linked to materials using OpenCalais. The final raw education dataset was 10 terabytes in size.

A 121 billion edge graph is too large to fit within the confines of a single machine. Fortunately, Titan/Cassandra is adistributed graph database able to represent a graph across a multi-machine cluster. The Amazon EC2 cluster utilized for the simulation was composed of 16 hi1.4xlarge machines. The specification of the machines is itemized below.

  • 60.5 GiB of memory
  • 35 EC2 Compute Units (16 virtual cores and 64-bit)
  • 2 SSD-based volumes each with 1024 GB of instance storage
  • I/O Performance: Very High (10 Gigabit Ethernet)

Educating the Planet - Disk WritesThe 10 terabyte, 121 billion edge graph was loaded into the cluster in 1.48 days at a rate of approximately 1.2 million edges a second with 0 failed transactions. These numbers were possible due to new developments in Titan 0.3.0 whereby graph partitioning is achieved using a domain-basedbyte order partitioner. With respect to education, the dataset maintains a structure where most edges areinter-university rather than intra-university (i.e. students and teachers typically interact with others at their own university, and even more so within their own courses). As such, domain partitioning is possible where vertices within the university (i.e. graph community) are more likely to be co-located on the same physical machine.

Once in Titan, the raw 10 terabyte dataset was transformed to 5 terabytes due to data agnostic Snappy compression and the use of Titan-specific graph compression techniques (e.g. “id deltas,” type definitions, and Kryo serialization). After the data was loaded, it was immediately backed up to Amazon S3. This process took 1.4 hours using Titan’s parallel backup strategy. The nodetool statistics for the Titan/Cassandra cluster are provided below.

Address         Rack    Status   State    Load        Token   rack1   Up       Normal   329.44 GB   Token(bytes[c000000000000000])  rack1   Up       Normal   348.62 GB   Token(bytes[3000000000000000])  rack1   Up       Normal   330.86 GB   Token(bytes[b000000000000000])  rack1   Up       Normal   333.57 GB   Token(bytes[a000000000000000])  rack1   Up       Normal   330.91 GB   Token(bytes[9000000000000000])  rack1   Up       Normal   326.57 GB   Token(bytes[f000000000000000])   rack1   Up       Normal   355.26 GB   Token(bytes[4000000000000000])  rack1   Up       Normal   325.73 GB   Token(bytes[e000000000000000])   rack1   Up       Normal   351.47 GB   Token(bytes[1000000000000000])   rack1   Up       Normal   332.87 GB   Token(bytes[d000000000000000])   rack1   Up       Normal   351.81 GB   Token(bytes[2000000000000000])    rack1   Up       Normal   331.56 GB   Token(bytes[8000000000000000])    rack1   Up       Normal   327.55 GB   Token(bytes[0000000000000000])   rack1   Up       Normal   345.2 GB    Token(bytes[5000000000000000])    rack1   Up       Normal   351.26 GB   Token(bytes[6000000000000000])  rack1   Up       Normal   338.07 GB   Token(bytes[7000000000000000])

Transactional Benchmark

Titan Cluster and Application ServersThe purpose of the second half of the experiment was to subject the 121 billion edge education graph to numerous concurrent transactions. These transactions simulate users interacting the graph — solving educational problems, adding more content, discussing ideas with one another, etc. To put the 16 hi1.4xlarge cluster under heavy load, 80 m1.medium machines were spawned. These 80 machines simulate the application servers querying the graph and providing users the front-end experience. Each machine maintained 30 threads in a “while(true)-loop,” randomly selecting 1 of 16 transactional templates below and executing it. Thus, 2,400 concurrent threads communicated with the 16 hi1.4xlarge Titan cluster. A review of the Gremlin queries and their mean runtimes with standard deviations are presented in the table below. Note that many of the transactions are “complex” in that they embody a series of actions taken by a user and thus, in a real-world setting, these would typically be broken up into smaller behavioral units (i.e. multiple individual transactions).

name             # of tx    avg (ms)   std dev   description
scommentread     25550909   211.07     45.56     student reads the most recent comments for their courses
reccourse        5149469    467.37     178.20    students gets recommended courses to take
scommentshare    12825909   394.15     57.98     student reads comments in courses and shares a comment
scontent         20567687   279.32     81.83     student retrieves all content for a single course in their course list
saddfollow       12826912   193.72     22.77     student follows another student
scourses         7720965    233.38     79.44     student retrieves a list of all their courses with description
classmates       12849769   96.962     22.27     student retrieves a list of all their classmates