Alain Rodrgiuez Lead Architect, Data Scientist at Teads
Teads is an innovative adserver. Our goals are to create links between advertisers and publishers and to provide the technical part to allow broadcasting video ads in an “outstream” way, so no more video content is needed to play a video ad.
I am the main data scientist and architect, in charge of storing our tracking data; used to provide real-time statistics and data used by our algorithm to chose the best ad to broadcast, millions of times, every day.
Cassandra at Teads
We use Cassandra in 3 distinct ways:
- We use a lot of counters to provide in real time statistics of the number of people exposed to any ad, or any website, and more.
- We store raw data to be able to grant (someday) more detailed statistics, crossing more dimensions, in a batch way, using hadoop.
- We store data to be able to give our algorithm data it needs to chose the best ad to display following set rules.
We started using Cassandra 0.8.0 about 2 years ago. We upgraded to each major release and are now using Cassandra 1.2.11.
We liked Cassandra’s main characteristics:
- No single point of failure (We has some SLA, and any down time is really expensive)
- Horizontal scaling (Using AWS, this is very easy and efficient)
- Write efficiency (We track a lot, so our use case fits well.)
- Presence of counters
- Peer to peer clustering, with no master/slaves.
We had no time to benchmark at this time to help us choosing the right technology so we did it after reading a lot on the web, and we chose Cassandra over HBase, mainly because of our use case which implies a lot of writes.
We now have one DC in AWS eu-west, with 28 nodes m1.xlarge, using Cassandra 1.2.11 and holding 300 GB data each.
We also have a replication factor set to 3 and make both reads and writes with a consistency level set to QUORUM.
We already tried opening a new DC and will go live with the second DC in a few weeks and a third should follow.
Advice on getting started
For the operational part, which is a very important part while using Cassandra, I think it is mandatory understand a bit of Cassandra internals. You need to understand how things work under the hood to be efficient. Cassandra needs a good configuration, and this configuration highly depends on your use case. You can’t just do things as other people do, it won’t necessarily work well for you.
So take the time to understand how this beautiful tool works, or you will regret it later.
Apache Cassandra community
The Cassandra community might be one of my favorite things about Cassandra. The community is active, all the time, and ready to help through multiple channels (irc, mails, github …).
Numbers can sometimes be more explicit than words: according to my Grokbase Cassandra user profile, I sent 274 mails to ask or answer questions. I am on the top 10 users using the mailing list. I almost all the time had answer to my questions and helped a lot of people.
Well, as you may have understood, the community is in the center of my Cassandra usage, and I think it should be this way for any user.