Illustration Image

Tracking Millions of Heartbeats on ZEE’s Streaming Platform

How strategic database migration + data (re)modeling improved latencies and cut database costs 5X ZEE is India’s largest media and entertainment business, covering broadcast TV, films,  streaming media, and music. ZEE5 is their premier OTT streaming service, available in over 190 countries with ~150M monthly active users. And every user’s playback experience, security, and recommendations rely upon a “heartbeat API” that processes a whopping 100B+ heartbeats per day. The engineers behind the system knew that continued business growth would stress their infrastructure (as well as the people reviewing the database bills). So, the team decided to rethink the system before it inflicted any heart attacks. TL;DR, they designed a system that’s loved internally and by users. And Jivesh Threja (Tech Lead) and Srinivas Shanmugam (Principal Architect) joined us on Valentine’s Day last year to share their experiences. They outlined the technical requirements for the replacement (cloud neutrality, multi-tenant readiness, simplicity of onboarding new use cases, and high throughput and low latency at optimal costs) and how that led to ScyllaDB. Then, they explained how they achieved their goals through a new stream processing pipeline, new API layer, and data (re)modeling. The initial results of their optimization: 5X cost savings (from $744K to $144K annually) and single-digit millisecond P99 read latency. Wrapping up, they shared lessons learned that could benefit anyone considering or using ScyllaDB. Here are some highlights from that talk… What’s a Heartbeat? “Heartbeat” refers to a request that’s fired at regular intervals during video playback on the ZEE5 OTT platform. These simple requests track what users are watching and how far they’ve progressed in each video. They’re essential for ZEE5’s “continue watching” functionality, which lets users pause content on one device then resume it on any device. They’re also instrumental for calculating key metrics, like concurrent viewership for a big event or the top shows this week. Why Change? ZEE5’s original heartbeat system was a web of different databases, each handling a specific part of the streaming experience. Although it was technically functional, this approach was expensive and locked them into a specific vendor ecosystem. The team recognized an opportunity to streamline their infrastructure– and they went for it. They wanted a system that wasn’t locked into any particular cloud provider, would cost less to operate, and could handle their massive scale with consistently fast performance – specifically, single-digit millisecond responses. Plus, they wanted the flexibility to add new features easily and the ability to offer their system to other streaming platforms. As Srinivas put it: “It needed to be multi-tenant ready so it could be reused for any OTT provider. And it needed to be easily extensible to new use cases without major architectural changes.” System Architecture, Before and After Here’s a look at their original system architecture with multiple databases: DynamoDB to store the basic heartbeat data Amazon RDS to store next and previous episode information Apache Solr to store persistent metadata One Redis instance to cache metadata Another Redis instance to store viewership details Click for a detailed view The ZEE5 team considered four main database options for this project: Redis, Cassandra, Apache Ignite, and ScyllaDB. After evaluation and benchmarking, they chose ScyllaDB. Some of the reasons Srinivas cited for this decision: “We don’t need an extra cache layer on top of the persistent database. ScyllaDB manages both the cache layer and the persistent database within the same infrastructure, ensuring low latency across regions, replication, and multi-cloud readiness. It works with any cloud vendor, including Azure, AWS, and GCP, and now offers managed support with a turnaround time of less than one hour.” The new architecture simplifies and flattens the previous system architecture structure. Click for a detailed view Now, all heartbeat events are pushed into their heartbeat topic, processed through stream processing, and ingested into ScyllaDB Cloud using ScyllaDB connectors. Whenever content is published, it’s ingested into their metadata topic and then inserted into ScyllaDB Cloud via metadata connectors. Srinivas concludes:  “With this new architecture, we successfully migrated workloads from DynamoDB, RDS, Redis, and Solr to ScyllaDB. This has resulted in a 5x cost reduction, bringing our monthly expenses down from $62,000 to around $12,000.” Deeper into the Design Next Jivesh shared more about their low-level design… Real-time stream processing pipeline In the real-time stream processing pipeline, heartbeats are sent to ScyllaDB at regular intervals. The heartbeat interval is set to 60 seconds, meaning that every frontend client sends a heartbeat every 60 seconds while a user is watching a video. These heartbeats pass through the playback stream processing system, business logic consumers transform that data into the required format – then the processed data is stored in ScyllaDB. Scalable API layer The first component in the scalable API layer is the heartbeat service, which is responsible for handling large volumes of data ingestion. Topics process the data, then it passes through a connector service and is stored in ScyllaDB. Another notable API layer service is the Concurrent Viewership Count service. This service uses ScyllaDB to retrieve concurrent viewership data – either per user or per asset (e.g., per ID). For example, if a movie is released, this service can tell how many users are watching the movie at any given moment. Metadata management use case One of the first major challenges ZEE5 faced was managing metadata for their massive OTT platform. Initially, they relied on a combination of three different databases – Solr, Redis, and Postgres – to handle their extensive metadata needs. Looking to optimize and simplify, they redesigned their data model to work with ScyllaDB instead – using ID as the partition key, along with materialized views. Here’s a look at their metadata model: create keyspace.meta_data ( id text, title text, show_id text, …, …, PRIMARY KEY((id),show_id) ) with compaction = {‘class’: ‘LeveledCompactionStrategy’ }; In this model, the ID serves as the partition key. Since this table experiences relatively few writes (a write occurs only when a new asset is released) but significantly more reads, they used Leveled Compaction Strategy to optimize performance. And, according to Jivesh, “Choosing the right partition and clustering keys helped us get a single-digit millisecond latency.” Viewership count use case Viewership Count is another use case that they moved to ScyllaDB. Viewership count can be tracked per user or per asset ID. ZEE5 decided to design a table where the user ID served as the partition key and the asset ID as the sort key – allowing viewership data to be efficiently queried. They set ScyllaDB’s TTL to match the 60-second heartbeat interval, ensuring that data automatically expires after the designated time. Additionally, they used ScyllaDB’s Time-Window Compaction Strategy to efficiently manage data in memory, clearing expired records based on the configured TTL. Jivesh explained, “This table is continuously updated with heartbeats from every front end and every user. As heartbeats arrive, viewership counts are tracked in real time and automatically cleared when the TTL expires. That lets us efficiently retrieve live viewership data using ScyllaDB.” Here’s their viewership count data model: CREATE TABLE keyspace.USER_SESSION_STREAM ( USER_ID text, DEVICE_ID text, ASSET_ID text, TITLE text, …, PRIMARY KEY((USER_ID), ASSET_ID) ) WITH default_time_to_live = 60 and compaction = { 'class' : 'TimeWindowCompactionStrategy' }; ScyllaDB Results and Lessons Learned The following load test report shows a throughput of 41.7K requests per second. This benchmark was conducted during the database selection process to evaluate performance under high load. Jivesh remarked, “Even with such a high throughput, we could achieve a microsecond write latency and average microsecond read latency. This really gave us a clear view of what ScyllaDB could do – and that helped us decide.” He then continued to share some facts that shed light on the scale of ZEE5’s ScyllaDB deployment: “We have around 9TB on ScyllaDB. Even with such a large volume of data, it’s able to handle latencies within microseconds and a single-digit millisecond, which is quite tremendous. We have a daily peak concurrent viewership count of 1 million. Every second, we are writing so much data into ScyllaDB and getting so much data out of it We process more than 100 billion heartbeats in a day. That’s quite huge.” The talk wrapped with the following lessons learned: Data modeling is the single most critical factor in achieving single-digit millisecond latencies. Choose the right quorum setting and compaction strategy. For example, does a heartbeat need to be written to every node before it can be read, or is a local quorum sufficient? Selecting the right quorum ensures the best balance between latency and SLA requirements. Choose Partition and Clustering Keys wisely – it’s not easy to modify them later. Use Materialized Views for faster lookups and avoid filter queries. Querying across partitions can degrade performance. Use prepared statements to improve efficiency. Use asynchronous queries for faster query processing. For instance, in the metadata model, 20 synchronous queries were executed in parallel, and ScyllaDB handled them within milliseconds. Zone-aware ScyllaDB clients help reduce cross-AZ (Availability Zone) network costs. Fetching data within the same AZ minimizes latency and significantly reduces network expenses.
Become part of our
growing community!
Welcome to Planet Cassandra, a community for Apache Cassandra®! We're a passionate and dedicated group of users, developers, and enthusiasts who are working together to make Cassandra the best it can be. Whether you're just getting started with Cassandra or you're an experienced user, there's a place for you in our community.
A dinosaur
Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.
© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation. Sponsored by Anant Corporation and Datastax, and Developed by Anant Corporation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?