Fanzo listens to what fans are saying about their teams in social media, scores and ranks it, then presents to fans the hottest, most engaged stories – be it news links, Tweets, Instagrams or Facebook posts.
During the Super Bowl there’s twenty-five million tweets during the five hour window, and most of the stuff is really not that interesting. What we do is find the interesting stuff and then present it to users in real time.
Basically whatever people are talking about we pick it up and the more people are talking about it, the higher it scores and then we show the top stuff. There’s a threshold that something has to meet before it shows up in what we call a channel. You can go into Fanzo and you can pick your favorite teams and then we organize all the content into those team channels, so it’s a lot easier to use then Twitter or Facebook or any of those others because it brings all the information into one spot; and just the best stuff. It’s a great way to follow your team, and get up to date in just a couple of minutes.
I’m the CEO/CTO of Fanzo and built the system. I do development, business, and ops, because we’re just a small little startup and when we were building it there were three of us.
In order to listen to what people are saying we need to hook up to some seriously large data feeds that move really quickly, because we’re trying to do all this in real time. Cassandra sits as part of our scoring system and as social media posts come in, we do some calculations and write it quickly to Cassandra, so we keep score of information not the content itself.
I conceived of the scoring system when we were going through the Techstars Program. We’re were moving fast and had a challenge of, “okay go hook up to the Twitter stream API and go score us some tweets” and we’re trying to figure out a place to store the data. I needed something, because I know Twitter can spike, that can handle a Super Bowl like spike. I wanted something that could write really quickly, and one of the other guys in the accelerator had some experience with Cassandra and suggested I check it out, and it looked great. I looked at a few other NoSQL databases and ended up choosing Cassandra for a couple of reasons.
One of them was obviously the write speed that it can handle, but another was it’s ability to handle node failures. Back in December of 2012 Azure wasn’t necessarily the most stable place in the world, especially for Linux CMs and so they would periodically disappear; so I needed something that could handle that when they restarted themselves and came back online. The combination of the cluster aspect, the ability to scale horizontally as we added more data feeds, and its just pure ability to write fast, were the reasons why we chose it.
The first Super Bowl we handled was January of 2013 and we just set a scale there. We upgraded from 1.1 to 1.2 a little over a year ago and during that process I increased the node size a little bit, because I was starting to store more information during that time.
We have a 5-node cluster that we’ve upgraded and grew a couple of times. It basically holds the content for sports in social media. It’s a lot of writes, which is again why we chose Cassandra because it can handle a huge write load.
We went with 5-nodes because that gave me the redundancy I needed, because I could lose two servers and still be down based on the replication factor and still be working in a solid way. I also needed the increase amount of disc and the way that Cassandra scales horizontally, you can just keep adding nodes and that actually increases the size of the data you can manage. Since I was storing more stuff, we needed more nodes, but I didn’t want too many because we are a cash strapped start up.
We kind of picked the number that we felt would make a good balance between redundancy, up-time and data management. It’s a virtualized environment and so you are limited by basically IOP, I/Os per second, and so I needed to spread that.
We’re pretty confident with our scale; we’ve now been through a number Super Bowls and it has handled the load no problem. The World Cup was bigger than the Super Bowl even. Maybe not at peak, but you know it was a month long, so it was pretty insane.
One of the nicest things about it is I haven’t really had to think about it much recently. It’s just there and it works. We’re still on 1.2, and I just haven’t had to touch it. That’s what you want a database to be. You don’t want to have to think about it.
The fact that I was able to figure it out, deploy it, build the system, and had ranked a 100,000 Twitter accounts within a week, is a real testament to kind of the easy use of getting it deployed and up and running and useful.
When we went to 1.2, we tried using some of the newer CQL stuff and ran into some challenges with that, although it turned out to be the data model that we were trying to shove down Cassandra’s throat. Actually, that’s where the community really came into play because I was running into some serious problems where the system would run for like a week and then it would hit a wall and I was pulling my hair out trying to figure out why.
I was asking bunches of question on the Cassandra user’s mailing list and I was trying to get help and figure out what was going on, because I was using some of the new stuff and they’re still finding bugs around like the new arrays and I was using some of that. They would respond and I would grab patches and rebuild my own version of Cassandra.
That’s the beauty of open source, you can just build the code and throw stuff in. Eventually I worked out what the problem was and tweaked the data model and all was well, but it would have been a lot harder without the help of the community.