Chris Reynard: Softwar Engineer at Jetpac
Matt Pfeil: Co-Founder at DataStax
Hi Planet Cassandra users, this is Matt Pfeill from DataStax. Today I’m joined by Chris Raynor, an Engineer from Jetpac. Chris, thanks for joining us today; why don’t you start off and let all of our listeners and readers understand a little bit more about what Jetpac does.
Jetpac is basically big data for images. We have processed billions of images from social media, using image processing and metadata analysis, to get better insights into venues and locations for recommendations. We’re going to be releasing a new product soon; for now, that’s pretty much all I can say.
That’s awesome. How do you guys use Cassandra?
We store a lot of image metadata from what we’ve collected through Facebook, Instagram, Twitter and Foursquare and many different services. We index and store the data in a plethora of different ways. We do this in order to permeate that data back and learn from the insights that it provides, giving us an idea of what locations near you would be interesting to visit.
Awesome. Is your data model something like you have one row for each image worth of metadata or can you help me understand that a little bit more?
We might pull in a specific person’s Facebook photos, for example. We initially index them by the person we originally got the images from. Then we would go through and determine where those images were taken. We would perhaps store all of the images from, let’s say, San Francisco in a row, in a different column family, so that it is easily assessable. Or we might do some image analysis and detect where all the venues are that certain people frequent. Where hipsters hang out, for example, from the amount of people with mustaches, putting all the image metadata that we need, including URL and various other metrics about that venue in a row, in another column family or with a different composite key. We index by a whole load of different methods based how we need to retrieve it from the front end later on.
It sounds like you’re storing a lot of metadata about a lot of different images. Can you share anything about either the number of images or how much data you guys are storing?
We have passed the two billion mark in number of raw images that we’ve pulled in and we index that at least 10 different ways. We probably store about 20 billion images worth of metadata. We do categorize the images, so I estimate we have in the tens of billions of rows. We also download and do a lot of image processing on the images as well, separately.
Where’s your infrastructure located? Are you running out of your own data centers or in the cloud?
We’re all EC2 right now, they’re all Ubuntu boxes, fairly vanilla. We have DataStax Community running for monitoring the service. We have 8 nodes in one availability zone and 4 in another for a large scale analysis.
That’s awesome. What’s the number one feature in Cassandra that you find is best for your use case?
The scalability is huge for us and having every node being a flat hierarchy is great; it means we can just add nodes whenever. We haven’t yet moved up to 1.2 but I’m excited about virtual nodes. Also, the ability to easily handle multiple data centers so that we can do analysis using Hadoop, without affecting the production cluster on the same data, has been very useful for us.
Chris, thank you very much for the information. Is there anything else you’d like to share with Planet Cassandra today?
We’re constantly blown away by Cassandra and the community; we’ve had some great training through DataStax and it’s really worked for us.