Netflix’s Data Reprocessing Pipeline
Problem
Netflix’s Asset Management Platform (AMP) is a centralized service that organizes, stores, and discovers digital media assets created during movie production. These assets go through a cycle of schema validation, versioning, access control, sharing, and triggering configured workflows. As the platform evolved, it began supporting not only studio applications but also data science and machine learning applications to discover asset metadata and build various data facts.
However, this evolution brought about a challenge. Netflix frequently received requests to update existing asset metadata or add new metadata for new features. This pattern grew over time, necessitating access and updates to existing asset metadata. Moreover, real-time APIs for asset metadata access, backed by the Cassandra database, didn’t fit analytics use cases by data science or machine learning teams.
Solution
To address this, Netflix built a data pipeline that could extract existing asset metadata and process it specifically for each new use case. This framework allowed the application to evolve and adapt to any unpredictable changes requested by platform clients without any downtime. Production asset operations were performed in parallel with older data reprocessing without any service downtime.
How Cassandra is Used in the Solution
Cassandra is the primary data store of the asset management service. Netflix designed their data tables in a way that data could be read with pagination in a performant manner, despite the lack of a pagination concept in No-SQL datastores like Cassandra. They read the assets data either by asset schema types or time bucket based on asset creation time. Data sharding was done based on the asset type and time buckets based on asset creation date, allowing for efficient distribution across Cassandra nodes.
For example, to fetch a list of asset ids, they used the asset type and time buckets. The asset id was defined as a Cassandra Timeuuid data type, which can be sorted and used to support pagination. Based on the page size, the first N rows were fetched from the table. The next page was fetched from the table with limit N and asset id < last asset id fetched.
In some cases, they had to reprocess a specific set of assets based on some field in the payload. Instead of using Cassandra to read assets based on time or an asset type and then further filter from those assets, they used Elasticsearch to search those assets, which was more performant.
After reading the asset ids, an event was created per asset id to be processed synchronously or asynchronously based on the use case. For asynchronous processing, events were sent to Apache Kafka topics to be processed.
The use of Apache Cassandra in Netflix’s data reprocessing pipeline demonstrates the database’s flexibility and scalability in handling large volumes of data. By designing their data tables thoughtfully, Netflix was able to leverage Cassandra’s features to efficiently read and update existing asset metadata, enabling the platform to evolve and adapt to new requirements without any service downtime. This use case underscores the power of Apache Cassandra in managing and processing big data in real-world applications.