March 24th, 2014

By 

 

 

Relay42 

 

“The large volume of events naturally required us to look for alternatives to a traditional RDBMS from the start.”

- Tomas Salfischberger , Managing Director, Co-founder at Relay42

 

Tomas Salfischberger

Tomas Salfischberger Managing Director, Co-founder at Relay42

 

 

Relay42

Relay42 is a Tag & Data Management Platform. With Tag Management we provide marketers a platform to integrate third-party tags such as Google Analytics and to have the flexibility to collect data for our data management offering. relaytagmanagementThe data management platform collects visitor-level interaction data both from the website as well as across all marketing channels and other external sources, resulting in 2.5 to 3 billion events stored each month.

What makes our platform unique is that we don’t just create reports with this data but we actually link it to personalized actions across different marketing channels such as email, banner-advertisements, videos and even the clients website.

 

Natural need for NoSQL

The large volume of events naturally required us to look for alternatives to a traditional RDBMS from the start. We tested several NoSQL-type solutions, mainly for scalability and raw performance, however things like resilience and crash-recovery were also important factors for us. Cassandra (version 0.6 at the time) scored very well both in scalability as well as ease of deployment and a very solid architecture. In later versions we’ve seen this improve further and further allowing us to store more data per node with each new version.

 

Distributed data

Our platform is quite widely distributed, currently using 13 datacenters across all continents, because latency to end-users is very important to us. The biggest bulk of data are the raw-events which we store in our main cluster on private hardware, this cluster is approaching 50 billion events stored and growing rapidly.

 

The reason for private hardware is that our workloads are largely IO-bound at which bare-metal hardware is much better than cloud-instances. Another reason is the nature of the data, it is non-personally identifiable information, but we still consider it privacy sensitive and thus don’t want to store it on shared systems or the public cloud.

 

Words of wisdom

Don’t be afraid to store data multiple times. It might feel counter intuitive to keep large amounts of data around “just in case”, but we do exactly that. We store every raw event we have ever received and process the raw data into more meaningful information in later steps. By never deleting the raw data we can always go back and process it again. So if we develop a new feature that requires data we didn’t extract before or if we decide that some new format might help performance, we run a large Hadoop-job that goes back through the historic data and processes it all again.

 

The Cassandra community

We started back in the Cassandra 0.6 days and at that time we of course ran into some bugs here and there. Back in those days Cassandra wasn’t as widespread as it is now, so it was a bit harder to find solutions to common problems. The Cassandra developers have always been very responsive to problems or suggestions, so that was great. The current much larger community is very open, helpful, and still growing, which is a very good indicator of the maturity of Cassandra as a product.

LinkedIn