Victor Anjos Engineering Manager & Data Scientist at Viafoura
Viafoura is a social monetization platform that enables brands with digital content to better engage, understand and monetize their online community. Viafoura’s easy-to-install plugins enable social login, commenting, content discovery and gamification as a way to gain deeper user insight. This data increases monetization opportunities in a variety of ways such as increased advertising revenues and better-informed content creation that keeps users engaged.
At Viafoura, I lead all technical development, architecture and data science. I am in charge of all things tech from technology acquisition, vendor acquisition, hiring, architecture, data modeling, operations, development and data science, as well as some involvement in pre-sales engineering and support.
Moving past MySQL
We did evaluate many other datastores on our way. We actually started with MySQL, moved to Percona’s vesion (with Galera) and then attempted POCs on Couchbase, HP Vertica, Redis, Riak, HBase and Aerospike.
This was both a greenfield and migration. The part that is now live is greenfield and was used as a POC to make sure we knew our way around Cassandra with very high velocity, how to tune it, how to best provision it, etc. The upcoming parts that will lead to us really surpassing our competition, will be part migration and part greenfield.
We will be migrating our business logic/data into Cassandra in the coming months and have plans to add many new levels of functionality to our product and business based on the way we are able to use Cassandra (vs something like MySQL or MongoDB).
Fit for Cassandra
Because we reside on some of the web’s most visited sites, we had the requirement to be able to service several thousands of requests per second, write much more than that (factors more) and keep it all resilient across geographical areas.
Another great fit for the Cassandra use case for us, is that we require very large lists of things to be returned very quickly for display. In order to accomplish this, Cassandra’s (extremely) long rows made sense for us, in that we are able to pull a ton of items relating to one query and do other work on them.
This makes it very easy for us to:
Very easily able to pull out super long lists of things almost immediately.
Distribute our data across the world thus not having to rely on any particular cloud vendor or location.
No longer have to worry about availability or partition tolerance and with eventual consistency, we know that our data will be replicated faster than we will need it.
We have started out with 1.2.0 and moved all the way up to 2.0.9, however in production we are mostly in 1.2.9+ (but not quite 2.0 yet).
We manage the entire ecosystem using NetflixOSS tools, most specifically Priam to configure Cassandra and things such as Eureka for node discovery, Asgard for configuration management, Hystrix/Turbine for real-time view of our services and others.
Currently we are in 3 datacenters, looking to bring that to much more.
We have ingested (in some of our active clusters) upwards of hundreds of GB/day (with an incredibly large proportion of the data constantly being TTLed away).
It was my first foray into Cassandra in a corporate environment, though I’ve been playing around and testing it for several years. I like to keep abreast of new technologies and once I heard of (the promise and premise) of Cassandra, I knew it was something I had to use eventually.
I definitely recommend to attend your local Cassandra meetup group, and if you do not have one, log onto Planet Cassandra and look for docs/videos.
Also, if you are brand new, simple Google searches may reveal MANY old docs that refer to pre 1.2 C* idioms, which may trip you up.
Overall, the community around Cassandra is amazing. I recommend the IRC channels, Planet Cassandra and your local meetups.