We primarily use Cassandra as a file index for B2, but over time we’ve also ended up using it for various file and account metadata, B1 groups, and even an in-index data store as a performance optimization. It’s grown over time, and now we have over 450 hosts across 7 clusters.
Founded in 2007, Backblaze, based in San Mateo, CA, is a cloud storage company with more than an exabyte of data under management and customers in over 175 countries.
The Backblaze Storage Cloud provides a foundation for businesses, developers, IT professionals, and individuals to store, backup, archive data, and, in partnership with other leading technology companies, host content, manage media, build their applications, and more.
The company started life as a computer backup company run from a one-bedroom apartment with a clear mission to make storing and using data astonishingly easy. Backblaze was a popular backup service in niches such as photography and genealogy in its early days. Its focus on keeping backup simple and easy to use has broadened its appeal over time.
Since 2013, Backblaze has regularly published hard drive performance statistics and insights — the only player in the market to do so. This transparency resonated with existing and prospective users and led to a following among developers and IT professionals that wanted to perform more complicated tasks beyond what Backblaze currently offered. With its Backblaze Computer Backup product firmly established, the company decided to develop Backblaze B2 Cloud Storage, a cloud storage platform providing infrastructure as a service (Iaas).
This IaaS would provide direct API access and enable developers and IT people to store in the cloud, retrieve and/or share data and scale up and down while only paying for what they used. Essentially, Backblaze B2 offers cloud storage similar to Amazon S3, Microsoft Azure Storage, and Google Cloud Storage that is easy to use, affordable, and trusted.
However, the existing Computer Backup stack wasn’t suitable for this new use case. It was a custom store, where the company controlled the client as well. “The Computer Backup index was very custom to how the Computer Backup system worked, so we had to start from scratch with Backblaze B2 and create a different system,” says Elliott Sims, Senior Systems Administrator at Backblaze.
It was at this point that Backblaze investigated distributed databases, and Apache Cassandra was introduced.
“We needed something that would handle really high write throughput and keep scaling on the write throughput. That forced us to look at distributed stores, and Apache Cassandra was the option that fitted what we needed.”
— Elliott Sims, Senior Systems Administrator
For the launch of Backblaze B2 in 2015, the company adopted Ansible for software provisioning, configuration management, and application deployment and paired it with Apache Cassandra for its data needs, applying templates and Reaper for handling repairs. In selecting Apache Cassandra, there were several vital benefits that made it the preferred choice for the new cloud solution:
Scalable and elastic performance that delivers
Backblaze needed to see an increase in write throughput as it added more hosts. “If you need just read throughput, traditional MySQL replication works great as you can keep stamping out replicas, but we needed write throughput as well. Cassandra was able to do that out of the box as opposed to using layering or our own sharding on top of MySQL,” says Sims. Backblaze has also found the write throughput and scaling with Cassandra have been very nearly linear. “30 machines does mean having three times the capacity of 10 machines with Apache Cassandra, or, you know, 2.95 times.”
Durability that withstands a data center going down
While write throughput was a key concern, Backblaze needed some standard database requirements as well: “When we put a piece of data in, it needed to stay there if the machine crashes. It can’t just lose a chunk of data, like some of the NoSQL databases are prone to do. It had to write it to disk and not be a cache that periodically snapshots, for instance,” says Sims.
Fault tolerance without downtimes
Like any company that values the sleep of its database operators, Backblaze wanted the ability to lose a machine overnight, “we didn’t want to have to page someone to scramble and fix it. We needed a solution that would withstand and ignore any single machine going offline.” As Apache Cassandra is decentralized and all data is automatically replicated to multiple nodes, there is no single point of failure
The benefits of Apache Cassandra 4.0 for Backblaze
Jumping to 4.0 is not an immediate consideration for Backblaze, “The way we use Apache Cassandra is as an authoritative source of truth for where user files are as opposed to analytics, and so on, so we are a little wary,” says Sims. But there are a number of features that Backblaze is excited about exploiting, and they anticipate a move with the release of 4.0.1 or later.
Incremental repairs to eliminate a tombstone issue
“The biggest one is incremental backups in conjunction with the leveled compaction strategy,” says Sims. Backblaze uses several small tables where tombstones (deleted data that hasn’t been purged from the disk) have been a problem. With SizeTieredCompactionStrategy (STCS), you can set Cassandra only to compact away tombstones that have been repaired. “This means you can set a really short gc_grace_seconds, which means the tombstones are cleaned up a lot more frequently on a small table that you might use […] They are a very small percentage of the total, so if we can run Incremental Repair on them multiple times a day those tombstones would be very short-lived – ten hours instead of ten days. It will essentially eliminate the problem.”
Easier data migrations with token distribution change
When performing a data center migration, Cassandra 4.0 will enable users to distribute tokens based on replication level instead of schema, which eliminates the chicken and egg dependency “Being able to distribute tokens based on specific schema is nice for balance,” says Sims. “But the new method cuts out a finicky process of manually assigning tokens for the new data center […] that will make my life easier for the remaining data center migrations and reduce the vnode count.”
Kubernetes for Cassandra
While Backblaze has a test environment for Kubernetes running right now, but Sims says they are moving more to the environment as Backblaze adds miscellaneous services to Backblaze B2. “I think we’ll end up moving to Kubernetes for Cassandra eventually, but at the moment we’re using bare metal controlled by Ansible […] Reaper has been easy to manage for the most part, and as we bring more clusters online, we’ll get more interested in the Reaper component of the cass-operator.”
While Sims says Apache Cassandra has various complexities and pain points, he sees it as delivering on its promises for Backblaze: “The promises of scalable write throughputs and the resiliency — where it keeps your data when it says it will — have been kept with Apache Cassandra. These are important things, and […] keeping your data is unfortunately not something I can say for some of the other new storage systems.”
Backblaze makes managing data astonishingly easy for businesses and consumers. The Backblaze Storage Cloud provides a foundational platform for a broad community of developers, IT generalists, entrepreneurs, and individuals who seek easy, affordable, trusted solutions. With more than an exabyte of data under management, the company currently works with customers in over 175 countries. Founded in 2007, the company is based in San Mateo, CA. For more information, please go to www.backblaze.com.