Josh Glover: Software Engineer at Videoplaza
Christian Hasker: Editor at Planet Cassandra, a DataStax Community Service
TL;DR: Videoplaza is a Swedish start-up company that does video advertising and streaming video. If you go to, say, the Aftonbladet website (one of Sweden’s most popular newspapers), and you watch some of their video content and ads popup, that’s their nefarious work.
Videoplaza uses Cassandra for their aggregate store for fast writes. They migrated the store from MongoDB, which was great for a document store and less great for a big scalable distributed system.
In production, Videoplaza is running 18 nodes on machines with 128 gigs of RAM, eight cores, and has four 250 gig SSDs running in JBOD mode which gives them about 14 terabytes of storage. Currently they have about 7 terabytes of data after eight months and have plans to soon add more storage to their Cassandra cluster.
Thanks for being here for today’s Apache Cassandra use case, Josh. If you can start off by telling us a little bit about what Videoplaza and what your role is, that would be great.
Sure. Videoplaza is a Swedish start-up company and we do video advertising and streaming video. If you go to, say, the Aftonbladet website (one of Sweden’s most popular newspapers), and you watch some of their video content and ads popup, that’s our nefarious work. We have an adserver and we also have a user interface for configuring ad campaigns for streaming video.
Similar on YouTube, if you watch a video on YouTube and it wraps an ad at the beginning and says, “It’s three, two, one,” you can skip or you do that for other companies.
Yeah, in fact, we can do that for YouTube content as well as of very recently.
Oh, cool. What’s your role there?
I am what’s called a software engineer. That means I do pretty much everything. For the past year, I’ve actually been working on a real time recording system which is based on Cassandra or rather Cassandra is the store for the aggregate data. That’s what I’ve been focused on.
How did you come to choose Cassandra for that? Did you ever experience with it before or what did you do in evaluation of other technologies alongside Cassandra?
Yeah, exactly. We looked at quite a few things. Our old system used MongoDB, which is great for a document store and less great for a big scalable distributed system. Really, I think with Cassandra, fast writes were quite important to us as well.
Time series data or I would imagine for reporting, is it?
We’re not actually doing exactly time series data. It has some similarities, we started out with Cassandra 1.1 using Thrift. We use the old data modeling stuff, not the CQL, the new hotness. We’re transitioning to that now and also the binary driver, which looks nice. Basically, what we have is a row key made up of aggregation points.
A reporting user might want to pull on advertising performance. You might want to know which ad was shown, at what time, and what certain geographic location a user was coming from. Those would all be dimensions. What we actually do is our row key would be specific values of those dimensions. In my example, it could be ad 27 and we concatenate the other values. It’s like ad 27 | London | January 3rd, 2013.
Our column values are actually just a counter; we don’t use Cassandra counters. We just use long integers and then we also track uniques with HyperLogLog counters. HyperLogLog is a big probabilistic data structure. You can basically put in some ID and then get roughly the count of unique IDs. That’s our data model. Cassandra does fast writes and reads are pretty fast too.
Are you still using Mongo for parts of your application and parts of your application gone through Cassandra?
Yeah, exactly. Cassandra is the aggregate store. It stores the counters and, of course, the information; what that counter represents. We use Mongo for the so-called dimension store. That’s all of the metadata about that. A dimension is, if you think of a report as a multidimensional cube, one dimension could be the ad; another dimension, the geographic location and the time. All of the client’s ads would be stored in that dimension store. For that, Mongo is really good because it’s just JSON document. We do still use Mongo for that.
Then the heavy lifting, you use Cassandra when you need scale.
Okay, great. If you could tell us a little bit about your environment, that would be great.
Yeah. In production, we’re running 18 nodes on machines with 128 gigs of RAM, eight cores, and we have four 250 gig SSDs running in JBOD mode which gives us about 14 terabytes of storage on the 18-node cluster.
How much data is used to run?
We’re at the 7 terabytes range. That’s about eight months of data. Obviously, we’re going to have to add more storage to the cluster pretty soon. We actually started out with 1.1. We’re running static tokens. We don’t have vnodes on our production cluster yet, but we’re trying to transition to vnodes because, obviously, that makes it easier to add capacity.
You’re going to go to 1.2 or skip over that and go to 2.0?
No. We’re on 1.2 now in production. We upgraded to that and then we have a secondary cluster. Basically, we’ve built that with vnodes and we’re probably going to transition the data to that, because that seems to be the easiest way. We were trying to switch to vnodes on the production cluster and then shuffle. With our write load, it was just going to take too long to actually shuffle the vnodes. We started assessing that and then we realize that that was not the easiest thing to do, so we backed off.
You’re in a single data center for the moment; do you have plans to go to multiple?
Probably. Obviously, as our operations grow we are just going to outgrow one data center. For now, we’re just in one.
Cool. Something the community is really interested in and you hit your data models, so that’s great. What advice would you pass along to others who are either looking at Cassandra or new to Cassandra? Something that you’ve learned along the way that you think would be worth passing along?
The main thing is get as much as data in Cassandra as possible as quickly as possible, because you won’t really be able to validate your data model and the performance until you have a lot of data. Obviously with any production system, you want to have it running stealthily or get that as early as possible because Cassandra is relatively easy to operate. Like any big distributed system, there are going to be surprises. Just put a production load on it as soon as possible before you actually let people read data from it.
Real world scenarios you would recommend. Don’t just do some tire kicking in development and think it’s going to work like that in production actually get production load on that.
Yeah, as soon as possible. This is probably obvious, but you’re going to want a secondary cluster for doing testing and development.
Do you replicate the production data into your dev?
Not directly but we have a way to aim production data also on to our data cluster, to turn that on and off as needed. That’s a good way to give a realistic load, but no, we don’t replicate into it. Cassandra, like I said, it’s fairly easy to operate.
Sometimes it’s a bit deceptively so like certain things you might want to do, appear to be so easy, but you still should definitely run them in test first.
Trust but verify.
Okay, good advice. You’re here at the Cassandra Summit. Any sessions you’re particularly looking forward to?
Yeah. The Cassandra Internals one tomorrow looks really exciting.
Aaron Morton, if you haven’t seen him, I think he’s a great speaker.
Well, thank you very much. I think we hit everything, is there anything else you’d like to add?
No, other than we’re really happy with Cassandra and also the community around Cassandra is great. We’ve had an opportunity to work with DataStax a bit as well. It’s great, the services you guys are providing both just to the community and also on a commercial basis.
Oh, thank you. What’s the community scene in Stockholm like? Is it starting to grow?
Yeah, there are lots of start-ups doing exciting stuff in Stockholm. With Spotify being the biggest obvious example, and they’re heavy users of Cassandra. There is a lot of interest in Cassandra. We hosted a meet-up there about a month ago and got about 60 people who showed up, there’s definitely a lot of interest in Cassandra.
We’re rolling out a DataStax Enterprise for Startups program, that if there’s ever any interest in running DataStax Enterprise, then let us know. Matt Pfeil, the founder of DataStax, he’s running the program where, if you meet Startup status requirements, you get DataStax Enterprise for nothing.
Well, thank you very much. This has been great.
Yes, thank you.