Richard Lowe: Co-founder and Principal Engineer at Arkivum
TL;DR: Arkivum provides an archive service aimed at storing data for the long-term, with typical customers being driven by legal or regulatory requirements. Data in Arkivum’s services are encrypted, where data is safe, secure and simple to access with a unique, 100-percent guarantee against data loss.
In the beginning, Arkivum started with an SQL database. As they developed it and started turning it into a product, it was clear that they had seen the limits of what they could do with a traditional database. Upon evaluating options for a NoSQL database, comparing Cassandra, MongoDB, and CouchDB, Cassandra came out on top because it’s a hybrid of the dynamo and big table, the support from DataStax and the community was good, and it was simple to get started with.
To support their 100% guarantee Arkivum needed a database like Cassandra with no single point of failure, which is one of the keys aspects that they apply to mitigate the risk of data loss. Arkivum uses Cassandra as their main database to store file system information as well as information about the data integrity and the location of their data in their system.
Hello, Planet Cassandra, this is Brady Gentile, Community Manager at DataStax. Today we have Richard Lowe; he is principal engineer at Arkivum. Richard thanks so much for joining us today.
To get things started off, would you be able to tell us a little bit about what Arkivum does?
We provide an archive service aimed at storing data for the long-term. By long-term we mean anything from five to ten years to several decades or even forever. The reasons why people might want to archive the data with us are usually driven by either compliance requirements, which might be legal or regulatory or down to the inherent value of the assets they have.
Data in our service is encrypted and verified and stored using open standards, so there’s no vendor lock-in and our customers’ data is safe, secure and simple to access and with a unique, 100-percent guarantee against data loss.
How are you using Apache Cassandra?
One way that we achieve the data safety in our archive is to store multiple copies of customer data across disk and tape. However, that data is as good as lost if we don’t have a reliable index of where all the copies are and how they fit into the file system so that we know the data is still intact.
We use Cassandra as our main database to store file system information as well as information about the data integrity and the location with our data in our system. Cassandra’s replication capability means that we have no single point of failure for our database, which is one of the keys aspects that we use to mitigate the risk of data loss in our system.
What was your motivation for choosing Cassandra; were there any other technologies that you had initially evaluated it against or maybe something that you switched from?
The technology that’s at the core of our service actually started off as a research project, and back then, when we started out, we were using an SQL database. As we developed it and started turning it into a product, it was clear that we’d seen the limits of what we could do with a traditional database.
Our customers need to be able to store millions of files with us, per customer, so we need to try to make sure our system can scale to cope with that. We also wanted to have our file system replicated and available across multiple nodes, which is a whole lot easier to do with Cassandra than it is with a traditional relational database.
When we were evaluating options for a NoSQL, Cassandra came out on top because it’s a hybrid of the dynamo and big table, and the support from DataStax and the community was good, and it was simple to get started with, which I didn’t find that with things like Hadoop. We also briefly looked at CouchDB and Mongo, but we didn’t find them to be as capable and straightforward to use as we did Cassandra.
That’s good you’ve had a great experience so far with it and that it’s fit your needs well. Can you share some insight into what your deployment looks like right now?
The data for our customers is stored in UK data centers, using disk and tape infrastructure that we own. We have multiple copies of customer data in multiple UK data centers, which is really the basis of our archive service. However, as well as the data that we store in our data centers, we also have what we call appliances on each customer’s site, and those appliances are also part of the network of nodes, so a customer has an appliance on a network to allow local access to the data in the archive.
If they have multiple sites, then we supply multiple appliances to them, one or more per site. These can be physical or virtual, depending on the requirements of the customer and the existing infrastructure they have.
Some of our customers operate in the cloud, so we have appliances deployed in services such as Amazon EC2 and IBM’s SmartCloud , as well as vertical-specific providers for life sciences, education and banking, so it’s really a mix of everything, which I think makes Arkivum pretty unique, from what I’ve seen of what people are doing with Cassandra.
We cross organizational networks as well as different cloud services, soWAN traffic is a major concern for us. Thankfully, we’ve been able to tune Cassandra to work across the WAN and to deal with challenges such as the slow and unreliable networks that you get when working with different organizations and different infrastructures.
In terms of our deployment, another major factor is security, which is a major concern for our customers, and being able to separate customer data and metadata, is very important to us, and Cassandra allows us to do this in a way that scales and provides resilience without having to put everything in the same pot or place.
We actually use the topology-based replication structure to do this. In Cassandra terms, each customer site acts as a separate data center, so we can control exactly what data goes where and our security implementation lets the nodes communicate about things like gossip but will block attempts to send or receive data from the wrong place, and each customer has different keyspaces that only that customer’s nodes can access. I guess it’s a fairly complex deployment. It’s definitely a heterogeneous deployment, but it’s what works for us, and Cassandra was able to support us in that.
What is the experience with the Apache Cassandra community whether it be the virtual community with IRC or the mailing list or the physical community in regards to like meet up events or community activities?
I would say that the Cassandra community has been instrumental in helping us to understand how we can best deploy Cassandra and to overcome some of the challenges that we faced. I haven’t really had the opportunity to interact with it as much as I would like, especially in the mailing list.
We’ve attended a few of the UK-based meetups, and I presented at the Cassandra EU talk in London last year, which the guys at Acunu organized. I’ve tried to do my bit, but it’s a difficult balance with the amount of the day-to-day work as well.