Alain Rodriguez Data Architect & Team Lead at Teads
“The Cassandra community might be one of my favorite things about Cassandra. The community is active, all the time, and ready to help…”
Alain Rodriguez Data Architect & Team Lead at Teads

Teads is an innovative video advertising branding platform, that offers both ad-serving and Supply Side platform (SSP) features. Our goal is to connect publishers and advertisers, by providing publishers with outStream ad formats we invented (allowing to serve an ad without requiring video content, in text for example – the inRead), and empowering brands by allowing them to deliver qualitative video campaigns in premium context through these formats. As an SSP, our platform allows publishers to optimize yield and keep full control over the ads served on their website, whether they are classic campaigns served via VAST, or programmatic campaigns served throughReal Time bidding. I am the main Data Architect, in charge of storing our tracking data. This tracking data is then used to provide real-time statistics which help us decide which ad is the best one to broadcast — millions of times — every day.

Apache Cassandra at Teads

Screen Shot 2014-12-01 at 11.33.57 PM

We’re currently using Apache Cassandra 1.2.18 but we plan to migrate to 2.0.11 very soon, and then to DataStax Enterprise 4.6.  After three years on our own we finally chose to use DataStax support, software and managed services.  We hope this DSE integration will help us in our stabilization and quality approach; that is now the main goal in our worldwide deployment.

We use Apache Cassandra in 3 distinct ways:
• We use a lot of counters to provide real time statistics of the number of people exposed to any ad, website, and more.

• We store data from our big data stack. Cassandra is an endpoint for our Spark stack. We cross dimensions to have all the desired information available to our customers. We currently process our data to have more than 100 simple and crossed dimensions

• We store data to be able to give our algorithm the data it needs to choose the best ad to display following a specific set of rules.

Choosing Apache Cassandra & Deployment Details

We needed, in every piece of our infrastructure, linearly and horizontally scalable technologies, highly available (with no Single Point of Failure), and allowing for worldwide deployment.

Our goal was to have a database able to handle a high throughput with low latencies and without sacrificing consistency.  We knew from the start that we were going to grow strongly and worldwide.

All these features already existed in Cassandra’s 0.8 version (first release of the distributed counter, which was also a cool feature for our time series counting metrics), when we started using it at the end of 2011.  We considered this choice as “future-proof” at the time, and can confirm today that it is.

We liked Cassandra’s main characteristics:
• No single point of failure (We have SLAs, and any down time is really expensive)

• Horizontal scaling (Using AWS, this is very easy and efficient)

• Write efficiency (We track a lot, so our use case fits well.)

• Presence of counters

• Peer to peer clustering, with no master/slaves.

At the time in 2011 we had no money, nor time for real stress tests and we did not perform relevant tests to choose the proper technology.  I have to confess that this choice was made by feel, trust, and after reading a lot on the Web.  We finally chose Cassandra over HBase mainly because of our use case, which implies a lot of writes; more precisely, Writes >> Reads, which is a use case Cassandra is good at.  Another reason had to do with Cassandra’s peer-to-peer system, which we liked more than the more classical master / slave system.

Deployment Details

In terms of numbers, we now use 30 i2.2xLarge EC2 nodes (Amazon – AWS) on two clusters. Our main cluster is using 2 DC (US-East + EU-West) and we plan to open a third one in Asia in a few weeks.

Another cluster handles the Big Data workload, since we use Cassandra as an endpoint for our Hadoop / Spark stacks. We chose to isolate this heavy (millions of writes per minute) and spiky workload to have a more predictable latency on the main cluster.

Monitoring: SPM & OpsCenter

As a very high growth company, we wanted to spend our time focused on our core business.  So we decided not to create custom tools.  In 2011, DataStax OpsCenter was not free to use and my Cassandra knowledge was very poor.  So I decided to use the “Cassandra Cluster Admin (https://github.com/sebgiroux/Cassandra-Cluster-Admin – not maintained anymore) to handle my data and AWS CloudWatch for performance metrics.  I discovered soon enough, however, that I needed to understand what was going on at the node level in terms of heap, GC, memory, latency, etc.  PhpMyAdmin-like tool and AWS metrics were not enough to understand what was causing our issues.

When DataStax OpsCenter opened up to every Cassandra user for free, I gave it a try and stuck with it for a long time.  OpsCenter delivered real improvements, since I was able to detect things like GC issues, heap pressures, latency spikes, and many more issues that I was able to fix afterward.  SPM Performance Monitoring and Reporting from Sematext supplements OpsCenter in a few really great ways: alerts for non-DataStax Enterprise users, speed, more intuitive UI, preconfigured dashboards (which help new users see what metrics are important).

We finally decided to mix the best of SPM and OpsCenter (Enterprise)

OpsCenter Strengths

• Managed service that allows us to act; it’s not only monitoring, it also shows a “ring”  and a “list” view with nodes represented as colored circles, that allow having a good overview on data size, load and status of the nodes. Plus it is a cool thing to show to people that do not know Cassandra, it is a very intuitive and useful UI.

• Data center proper handling is very nice.  With SPM, you have to create distinct apps for new clusters and DC; OpsCenter allows aggregated views and one app to access anything.

• Metrics are expressed smartly, like GC is expressed as “x ms/s”.  Knowing that x% of your time is lost due to GC is more relevant than a sum of GC times, as SPM shows.

SPM Strengths

• Well designed, preconfigured, very useful views (Overview, GC, etc.), and an interface which saves us some time.  While in OpsCenter I have to think of the dashboards and then configure them.

• Same tool for Cassandra and all our Big Data projects (YARN – Hadoop, Spark, Kafka, Storm, and more).

• Sematext engineers are very responsive and things we report get fixed quickly.  A chat tool is available for asking anything.

• And last but not the least: having monitoring data on the exact same system that you are monitoring can be very bad; if your node goes down, you lose your monitoring — at the moment you need it the most!  OpsCenter builds by default a keyspace in Cassandra nodes, which is not replicated enough to allow HA and also takes part of the available resources for monitoring purposes.

Thanks to SPM we now have a very good view of what is happening in terms of Cassandra performance.  This kind of application is fully realized on outage or slowness issues.  With any outage we’ve had since we started using SPM, I always found the root cause have been able to fix or mitigate things.  Any downtime — even for just a few minutes — can lead to hundreds of thousands of dollars lost, plus a negative opinion of Teads by our customers and a negative impact on our image.  Today it is really worth it to invest in a good monitoring solution. I believe SPM belongs to be one of them.

Getting Started Advice

My advice, based on my own operational experience, which is a very important aspect of using Cassandra, is this: I think it is mandatory to understand a bit of Cassandra’s internals.  You need to understand how things work under the hood to be efficient.  Cassandra needs a good configuration, and this configuration depends significantly on your use case. You can’t just do things as other people do because it won’t necessarily work well for you.  So take the time to understand how Cassandra works, or you will regret it later.

To help you on this, I think that nowadays, it makes sense to subscribe to DataStax Enterprise (DSE) or/and Sematext soon, contrary to what we did, since DSE is now free for startups and Sematext also offers a program for startups, nonprofits and educational institutions with free and discounted versions of its products.  Plus, the Sematext guys are very comprehensive and let you test the product as much as needed before subscribing to a plan of your choice.

I find that teams working with Cassandra are often very small (e.g., less than five guys for companies like Spotify and Netflix; and I am the only full time Data Architect at Teads), and I think that the support offered by DataStax and the monitoring tools can help you save some time and money since you can find and fix any issue accurately and efficiently.  It is really worth a try since both solutions are free for testing purpose, and afterward the cost will be proportional to your usage — and affordable in most cases.

For those just starting, SPM’s preconfigured dashboards will help you since all the important metrics are exposed by default; you just have to install it (through an agent or not, that is up to you) and you will be good to go!  Those metrics, correctly exposed, will enhance your comprehension of Cassandra’s internals as well.

Apache Cassandra Community

The Cassandra community might be one of my favorite things about Cassandra. The community is active, all the time, and ready to help through multiple channels (irc, mails, github …).

Numbers can sometimes be more explicit than words: according to my Grokbase Cassandra user profile, I sent 274 mails to ask or answer questions. I am on the top 10 users using the mailing list. I almost all the time had answer to my questions and helped a lot of people.

Well, as you may have understood, the community is in the center of my Cassandra usage, and I think it should be this way for any user.

LinkedIn
Follow @twitter