July 9th, 2014

I have grown increasingly frustrated with the world as people have become more and more convinced that “schema-less” is actually a feature to be proud of (or even exists). For over ten years I’ve worked with close to a dozen different databases in production and have not once seen “schemaless” truly manifest. What’s extremely frustrating is seeing this from vendors, who should really know better. At best, we should be using the description “provides little to no help in enforcing a schema” or “you’re on your own, good luck.”

Before we dig any further into this topic, I think it’s important to understand how we got here. Let’s go back 5-10 years. The default database for a majority of projects was typically a RDBMS. MySQL is all the rage and people can’t get enough PHP. ORMs like hibernate made it easier to work with a RDBMS without actually treating it like one. Everyone is sort of happy, but the tools really aren’t all that great and mostly do a poor job of abstracting the DB away, or require XML files or some other nonsense. In the end, the ORM could only take away some of the annoyance, mostly manually written SQL queries. You still had to deal with the other (perceived) downsides of the RDBMS:

  • It gets slow to add columns in the most popular implementations as your data set grows.
  • Joins don’t perform well with large datasets.
  • People wanted to think in terms of objects, and see having a schema as pointless overhead.
  • SQL is perceived as hard
  • If you’re interested in OLTP workloads, the RDBMS requires sharding and manual management
  • Once you shard you lose out on most the fun SQL features

Looking at each of these limitations, the classic “lets just throw everything out and start from scratch” mentality kicked in. This seems to be an effort to solve the root problem, but the problem is misdiagnosed. The decision to eliminate schema from the database because there are some problems with the implementations of schemas is total nonsense.

Slow alterations is sometimes hailed as a reason to go schema-less. Logically though, there is no reason why changing your schema inherently must result in downtime or take hours. The only reason why this is the case with the RDBMS world is because it generally requires a rewrite of the db files to include the new column. This is an implementation detail, and a choice, not a requirement. Cassandra, for instance, uses a schema, and can perform CQL table alterations without rewriting the entire table, as does sqlite. So this reason is a crock and can be dismissed as an excuse to not support a typed schema.

Joins don’t perform spectacularly across machines due to a number of reasons. For OLTP workloads the common advice is to denormalize your data. Copying your data once incurs a penalty on writes but returns many fold on reads. For OLAP workloads, the industry has been doing a lot of map reduce, but this is most likely a means to an end. Writing map reduce requires too much developer work and will mostly be replaced by tools that either utilize a M/R framework or something better. The current best candidate for this is SparkSQL, which integrates well with existing data stores, and uses a DAG (directed acyclic graph) instead of map/reduce. I’m not sure how you could use a tool like Spark and gleam any useful information out of a database without any sort of defined structure somewhere. I think this is a clear case where explicit schema is highly beneficial, so I’ll add this to my list of “features that don’t need to be thrown out just because they’re associated w/ an RDBMS.”

The third point, “schema as pointless overhead” is nonsense. In any non trivial application, there is going to be a defined data model. The purpose of a schema is not for the developer writing an MVP by himself, it’s for the team that has to deal with his code for the foreseeable future. Schema is important, it’s the only way that everyone that didn’t write the original code can sanely understand the data model. Providing a central location (the database!) to look at this schema, and a human parsable / writable query language to examine it is undeniably useful. The overhead with any non-trivial application will quickly shift to the developers maintaining the codebase and away from managing the schema. Tools like Alembic make it straightforward to manage RDBMS schema changes, as well as the sync_table functionality in cqlengine, for Cassandra. The overhead of managing schema at this point is significantly less than the mental overhead your dev team must exert. Not maintaining your schema in a database becomes a form of technical debt.

SQL can be confusing for someone new to the language, but that’s hardly an excuse to throw either it or explicit schema out. SQL was designed to be human readable and writable, and a solid understand of it wields its user incredibly power and flexibility. Yes, you can write some really nasty queries, but there’s no reason to use a dull knife all the time just because you are afraid someone might cut themselves. This is strongly supported if you consider the alternatives so far. As a replacement for SQL, query languages based on JSON are total crap. They discourage you to actually look at your database directly, requiring code to compose queries frequently based on nested data structures. At a glance, it’s easy enough to compose simple queries for a handful of rows. As the feature set grows however, you end up with something that’s completely unmanageable and a nightmare to edit. Even if we decided to ditch SQL forever in favor of some other query language, it STILL does not force our hand to even consider removing explicit schema.

Manual sharding with an RDBMS is a huge pain. Having to expand your production database cluster via home grown migrations is at best an annoyance and at worst a massive liability. This certainly has contributed to the decline of the RDBMS, but there’s no reason why this should lead to a wild west-esque data model. Like the other excuses above, it has nothing to do with the case against schema.

So, for the above reasons, there is a stigma against the RDBMS, and with it, a rejection of explicit database schema. How the two got to be tangled so close together is a mystery – in any of the above points is there a solid technical reason to ditch schema? Not one.

If there isn’t a technical reason, there must be an advantage for a development team. Without schema we’re free to… do what exactly? Create a new class per record? That sounds tedious after a handful of classes, and impossible with millions. Put heterogeneous objects in the same container? That’s useful, but nothing new. Sqlalchemy, a Python ORM has had this for a while now, so has cqlengine.

We don’t seem to gain much in terms of database flexibility. Is our application more flexible? I don’t think so. Even without our schema explicitly defined in our database, it’s there… somewhere. You simply have to search through hundreds of thousands of lines to find all the little bits of it. It has the potential to be in several places, making it harder to properly identify. The reality of these codebases is that they are error prone and rarely lack the necessary documentation. This problem is magnified when there are multiple codebases talking to the same database. This is not an uncommon practice for reporting or analytical purposes.

Finally, all this “flexibility” rears its head in the same way that PHP and Javascript’s “neat” weak typing stabs you right in the face. There are some somethings you can be cavalier about, and some things you should be strict about. Your data model is one you absolutely need to be strict on. If a field should store an int, it should store nothing else. Not a string, not a picture of a horse, but an integer. It’s nice to know that I have my database doing type checking for me and I can expect a field to be the same type across all records.

All this leads us to an undeniable fact: There is always a schema. Wearing “I don’t do schema” as a badge of honor is a complete joke and encourages a terrible development practice.

The Myth of Schema-less was created by Jon Haddad, Apache Cassandra Evangelist at DataStax.