Andres Rangel Senior Software Engineer at Hulu
Hulu is an online video service that offers a selection
of hit TV shows, clips, movies and more on the free, ad-supported Hulu.com service, and the subscription service Hulu Plus. One of the top video streaming sites in the U.S., today they have over 5 million subscribers and approximately 30 million unique viewers per month (http://blog.hulu.com/2013/12/18/a-strong-2013/) .
I am sr software engineer in the core services team; we build scalable highly-available systems to support the website and mobile devices. Matt Jurik is the engineer in charge of our cassandra cluster; he is also from the same group.
Storing where you left off, in real-time
We are currently using Apache Cassandra for several services here at Hulu. One particular service is for storing subscriber watch history intended for real-time access by other internal services; we use Cassandra to handle persistence and multi datacenter replication. All updates are written to both a caching tier as well as directly to cassandra. This has allowed us great flexibility with our caching tier while having a reliable persistence layer to fall back on.
Cassandra offers good performance, near linear scalability for our data model, and geo-replication all with minimal maintenance requirements. We evaluated HBase and Riak, but ultimately deemed that Cassandra satisfied our needs best.
Running Cassandra at Hulu
Our primary cluster is running version 1.2.12 and consists of 16 nodes split between 2 datacenters. Our watch history keyspace contains several billion CQL3 rows with approximately 1TB of data per datacenter. The individual nodes are 12-core machines with 48GB RAM using multiple SSDs in RAID5 configuration.
Words of wisdom
It’s important to analyze how you are going to query your data. Spending time to design your schema around your query pattern can save a lot of hassle debugging performance issues while also ensuring that you can scale easily. Additionally, having a high-level understanding of some of the internals such has how deletions are implemented, how secondary indices operate, and when to use the row cache can go a long way in designing a strong application built atop Cassandra.
The community is fantastic. The #cassandra IRC channel is a lively bunch; folks are always willing to help out and offer advice.