Evolving Schemaless into a Distributed SQL Database

5/19/2022

Reading time:8

This resource is based on an article originally published here.

In 2016 we published blog posts (I, II) about Schemaless – Uber Engineering’s Scalable Datastore. We went over the design of Schemaless as well as explained the reasoning behind developing it. In this post today we are going to talk about the evolution of Schemaless into a general-purpose transactional database called Docstore.

Docstore is a general-purpose multi-model database that provides a strict serializability consistency model on a partition level and can scale horizontally to serve high volume workloads. Features such as Transaction, Materialized View, Associations, and Change Data Capture combined with modeling flexibility and rich query support, significantly improve developer productivity, and reduce the time to market for new applications at Uber.

Docstore is currently in production and is serving business-critical use cases.

Schemaless was originally designed as an append-only datastore. The smallest entity is called a cell and it is immutable. Removing mutability reduced the complexity of the system and made it less error prone. However, over time we realized that the restrictive API and modeling capabilities made it hard for users to use as a general-purpose database.

The shortcomings of Schemaless resulted in the introduction of Cassandra, which did offer a lot of flexibility and ease of use. However, Cassandra had other shortcomings. Uber has a significant data footprint and at that scale, scalability and efficiency must go hand in hand. We found Cassandra to be lacking operational maturity at Uber’s scale and at the same time it did not provide the desired level of efficiency. The eventual consistency offered by Cassandra also ended up impeding developer productivity as they had to design around the lack of strong consistency and this complicated the application architecture.

Having first-hand experience developing and operating Schemaless and Cassandra, we came to the conclusion that evolving Schemaless into a general-purpose transactional database would be the best choice. Schemaless had traditionally been a highly reliable system but now we needed to focus on usability while achieving similar or better reliability!

We did not want to build a NoSQL system. We wanted to provide the best of both worlds: schema flexibility of document models, and schema enforcement seen in traditional relational models.

We have designed tables in Docstore to enforce a schema on the data. The applications that work with the data usually assume some kind of structure. This means either they utilize schema-on-read in which case the application interprets the data when it is read or they utilize schema-on-write, which ensures that schema is explicit and the database ensures schema conformity of data. We support the latter approach of “schema-on-write” by default.

Together with the schema enforcement described above, Docstore also provides schema flexibility. Schemas can evolve. Records with different schemas are allowed to co-exist and schema updates do not require full table rebuilds. Sparseness and support for complex nested data types are first-class features.

Docstore has the following features built-in out of the box. It integrates with the Uber software ecosystem and can be provisioned with the single-click of a button.

Docstore has a layered architecture and a Docstore deployment is called an instance. Each instance is divided into a query engine layer, a storage engine layer, and a control plane.

Figure 2 – Docstore layered architecture

The query layer is stateless and is responsible for routing the requests to the storage layer.

The storage engine, responsible for storing the data, is organized as a set of partitions, and data is distributed across these partitions. The control plane is responsible for assigning shards to Docstore partitions and adaptively adjusts the shard placement in response to failure events.

Docstore has a concept of tables. Tables look like relational database tables in that they are structured with rows, columns, and values. There are no restrictions on how tables are modeled in Docstore and with the use of user defined types, Docstore can store nested records as rows. This is useful, for example, when the data has a document-like structure where the entire hierarchy is loaded at once. Docstore also provides support for “Associations”, which allows for representation of one-to-many and many-to-many relationships. We call this the Flexible Document Model as it allows modeling both relational and hierarchical data models. We will cover data modeling in Docstore in the second part of this blog series.

Each table can have one or more materialized views. A materialized view is a view that allows the data to be partitioned differently as compared to the main table, by using different columns. Adding a materialized view partitioned by a non-primary key column makes it efficient to look up data by that column and allows different query access patterns.

Every table must have a primary key and that primary key can be composed of one or more columns. The primary key identifies the row in the table and enforces a unique constraint. Internally, the primary key and partition key columns are stored as a byte array and the value is derived by doing the order-preserving encoding of the key column values. Docstore stores rows in sorted order by primary key values. This combined with composite partition keys enables sophisticated query patterns, including grabbing all rows with a given partition key or using the remainder of the primary key to narrow the relevant rows for a particular query.

Naturally, for us, the next step in our design process was to deal with implementing the sharding logic. Tables are sharded and distributed across multiple shards – transparently to the application. Each shard, representing a set of rows from a table on the order of a few hundred gigabytes, is assigned in its entirety to a partition. A partition can contain one or more shards.

The important design consideration was to let applications control the data locality through their choices of keys. That’s how we introduced partition keys in addition to primary keys. An application can choose to explicitly define partition keys in the schema otherwise Docstore uses primary keys to shard the data.

Typically, there are multiple partitions in each Docstore instance. To eliminate a single point of failure, a partition is a group of 3-5 nodes and each node is a unit of physical isolation and is deployed to an independent zone. Each partition is replicated to multiple geographic locations to provide resilience to data center failures.

Replicated state machine

To ensure consistency, each partition runs the raft consensus protocol. There is a single leader and multiple followers.

All writes are initiated by the leader. The consensus protocol implemented is used to keep the replicated log consistent across the nodes in the partition. This guarantees that all the nodes in a partition contain the same writes in the same order ensuring serializability. The state machine running on each node only proceeds in committing the write if and only if consensus is reached. This gives a very nice property that if a write to a key is successfully committed, all the subsequent reads via the same key return the same data written by this particular operation or some later operation.

Consistency Model

Docstore provides a strict serializability consistency model on a partition level. This provides a nice property to the user where they can imagine that transactions are getting executed sequentially. Transactions are ordered such that a transaction ‘A’ started and committed before transaction ‘B’, always occurs before B. This ensures that a read operation always returns the result from the most recent write. From a CAP theorem terminology, Docstore favors consistency over availability and therefore is a CP system.

Transactions

Docstore uses MySQL as the underlying database engine. The unit of replication in the Replicated State Machine is a MySQL transaction. All the operations are executed within the context of a MySQL transaction to guarantee ACID semantics. These transactions are then replicated across the nodes using the raft consensus protocol.

Figure 6 – Sequence of operations in a transaction

We depend on MySQL for concurrency control. It is important to realize that MySQL relies on row locks for concurrency control of write operations (insert, update, deletes). Thus, the concurrent updates to the same rows are effectively serialized by MySQL, and by the time the control flow reaches the client issuing the commit, all the locking has been taken care of.

We use the flowchart in figure 7 to show what happens when two transactions interleave in time. We use the boxes at the different positions on the time axis to express the interleaving, i.e. different boxes correspond to different ‘events’ in time.

A raft based replicated state machine implementation allows for exposing MySQL’s transaction to clients in a highly-available way, i.e. all the replicas coordinate with each other to apply the transactions so that automatic failover among replicas is possible while the ACID property of transactions are still maintained even if failover happens. Note that because we rely on exposing MySQL’s transactions to clients, all of the benefits and the constraints of MySQL transactions are inherited.

In this article, we have covered the genesis and the motivation behind Docstore. We have also dived deep into the architecture and explained how transactions are handled in Docstore. In the next part of this multi-part series, we will focus on data modeling and schema management. We will cover how Docstore can support hierarchical and relational models and what type of applications should choose these data models. In the third and concluding part of this series we will do a detailed drilldown of Materialized Views in Docstore. This includes the motivation, the MV refresh framework and how we plan to leverage MVs even if it is not explicitly mentioned in the query.

If you enjoyed this article and want to work on systems and challenges related to globally distributed systems that power the operational storage of Uber, please explore and apply to open positions in the storage team on Uber’s career portal.

Comments

Related Articles

cassandra

langchain

llamaindex

GitHub - michelderu/chat-with-your-data-in-cassandra: Chat with your data stored in DataStax Enterprise, Astra DB and Apache Cassandra - In Natural Language!

3/26/2024

api.management

snowflake

oracle

GitHub - dreamfactorysoftware/dreamfactory: DreamFactory API Management Platform

3/7/2024

python

java

cassandra

Vald

2/11/2024

mongo

code.generation

sqlite

GitHub - loopbackio/loopback-next: LoopBack makes it easy to build modern API applications that require complex integrations.

1/26/2024

integration

ignite

cassandra

Ignite Cassandra Integration Usage Examples

1/20/2024

hive

elasticsearch

cassandra

GitHub - embulk/embulk: Embulk: Pluggable Bulk Data Loader.

12/1/2023

hive

firebird

oracle

DBeaver Community | Free Universal Database Tool

5/18/2023

mongo

cassandra

artificial.intelligence

Machine Learning In Your Database Using SQL

11/10/2022

yugabyte

cassandra

postgres

A Distributed Cloud Native Database for a Highly Scalable Data Store

8/28/2022

schemaless

uber

cassandra

System Design of Uber App - Uber System Architecture - GeeksforGeeks

5/19/2022

Explore Further

mysql

api.management

snowflake

oracle

GitHub - dreamfactorysoftware/dreamfactory: DreamFactory API Management Platform

3/7/2024

python

java

cassandra

Vald

2/11/2024

mongo

code.generation

sqlite

GitHub - loopbackio/loopback-next: LoopBack makes it easy to build modern API applications that require complex integrations.

1/26/2024

hive

elasticsearch

cassandra

GitHub - embulk/embulk: Embulk: Pluggable Bulk Data Loader.

12/1/2023

schemaless

uber

cassandra

System Design of Uber App - Uber System Architecture - GeeksforGeeks

5/19/2022

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

uber

schemaless

uber

cassandra

System Design of Uber App - Uber System Architecture - GeeksforGeeks

5/19/2022

mesos

uber

cassandra

How Uber Manages a Million Writes Per Second Using Mesos and Cassandra Across Multiple Datacenters - High Scalability -

5/19/2022

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

Replicated state machine

Consistency Model

Transactions

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?