Brady Gentile: Community Manager at DataStax
Benjamin Hawkes-Lewis: Technical Architect at VisualDNA
Rafal Kwasny: Senior Systems Administrator at VisualDNA
Brady: Thanks for joining us today, Benjamin and Rafal; could you talk to us a little about what VisualDNA does?
VisualDNA: VisualDNA provides true online audience targeting and insight solutions to the world’s leading digital agencies, advertisers and publishers. We have 161 million active profiles every month and collect over 6 billion events monthly. We also provide audience analytics to publishers for free – have a look at http://why.visualdna.com
Brady: And how is Cassandra involved in the mix?
VisualDNA: Cassandra serves as our canonical storage for consolidated profile data — user details, logins, survey data, partner datapoints, modelled insights.
We also use Hadoop and Hive heavily and rely on Kafka streams for moving the data around. We are augmenting our existing batch processing with increasing amounts of near realtime event-driven processing. Cassandra fits nicely for this requirement.
Brady: What was your motivation for using Cassandra and what other technologies was it evaluated against?
VisualDNA: In 2010 it was necessary to upgrade our platform to handle increasing scale of data, as the existing MySQL based platform wasn’t coping with the size and velocity of data.
At that time the platform consisted of:
- A series of bespoke online surveys talking to a web service sitting on self-hosted LAMP stack.
- Behavioral inference running on EMR against web traffic events stored in S3 and pushing to our self-hosted MySQL databases.
Options evaluated for our new profile storage included MySQL, CloudDB, MongoDB, Redis, and Cassandra. We chose Cassandra because:
- We wanted to support extremely high write rates for permanent storage, and Cassandra was optimised for write availability and performance (with eventual consistency). Consistency was important for us because the survey history would be user facing. Reads and rapid consistency would be provided by an intermediate caching layer (Memcache).
- We expected significant international expansion, initially into the US, and Cassandra was already shipping with support for replication across multiple datacentres.
- Cassandra performed best under our initial load tests.
We like to run our infrastructure on the bleeding edge versions to keep up with the newest features our developers can incorporate into their projects. Cassandra works best for us as a cold on-disk datastore, we also keep aggregated data in memory using Redis.
Brady: Can you share some insight on what your deployment looks like?
VisualDNA: We have a 16 node cluster with over 200 spinning disks. With newest advancements like Leveled Compaction nodes can store TB’s of data and still respond in ms. Data volume is one of the reasons we choose Cassandra; some of our column families are storing over 5 billion rows. We spend lots of time optimizing Java GC for our workload.
Brady: What’s your favorite part about Apache Cassandra?
VisualDNA: They key to running large clusters used by real-time applications is maintainability, we can turn off nodes, commision new or decommission old ones without stopping the whole cluster.
Also, the speed with which the codebase evolves and matures is phenomenal – thanks to the community maintaining and developing it.
Brady: What would you like to see out of Apache Cassandra in future versions?
VisualDNA: We would welcome performance improvements, especially doing random reads on large datasets.
Brady: What’s your experience with the Apache Cassandra community?
VisualDNA: London Cassandra Users Group is well organized, thanks to Dave Gardner who is managing it. I would also recommend joining #Cassandra on Freenode; currently there are just over 200 people on the channel but I think C* has far more users.
Brady Gentile: Anything else that you’d like to add?
If you’re in London (or want to work in London) and Cassandra is your thing, speak to us – we’re always hiring.