Christian Hasker: Editor at Planet Cassandra, A DataStax Community Service
TL;DR: Parse started out as a platform that focused on providing tools for a lot of mobile developers to focus on building the apps, rather than worrying about databases and handling servers. The things Parse offers: a data store, an API for sending push notifications to the various platforms, wrappers for common social APIs, web hosting, file-hosting, custom server code and hooks, and now they also offer analytics products.
Parse researched a number of different databases and Cassandra continued to surface as an option that the community was very happy with and it seemed particularly suited for their use cases.
Parse typically sees a number of game companies that use their services for Facebook games. Also, a lot of photo-sharing apps, which could be an indication of the overall app economy as a whole. Parse is currently using 12 nodes, somewhere on the order of hundreds of GBs and they expect their total usage to grow in the near future.
I’m joined today by Christine Yen of the Parse division at Facebook. Christine is a Software Engineer at Parse; Christine, why don’t you start off by telling us a little bit about what Parse does within Facebook and what your role is there.
Parse started out as a platform that focused on providing tools for a lot of mobile developers to focus on building the apps, rather than worrying about databases and handling servers. Since that initial goal, we’ve expanded to multiple platforms and we also now support some web products. To list some of the things we offer: a data store, an API for sending push notifications to the various platforms, wrappers for common social APIs, web hosting, file-hosting, custom server code and hooks, and now we also offer our analytics products.
Great, and your role there, Christine? How are you involved in some of those initiatives that you just outlined?
Sure. I’m a Software Engineer, focused on analytics.
How are you using Cassandra within Parse? One of the things that intrigued me, when you started out, was that you said you don’t want developers to have to worry about the database layer.
We used MongoDB, which works really well for developers who want to store their data in flexible schemas and not have to worry about migrations, but is not as good for the load that we’re anticipating we’ll need to provide a solid analytics product, namely high write capacity and high availability that will be able to capture as much as data as we’re going to be throwing at them.
We researched a number of different databases and Cassandra continued to surface as an option that the community was very happy with and it seemed particularly suited for our use cases.
If you could walk us through your architecture, that would be great.
Sure. Our architecture actually has several other components, not just the MongoDB and Cassandra components. We also use Redis to help store some of our push notification information and generic queues. From working with both Mongo and Redis, we are very familiar with the trials and tribulations of keeping clusters up. We knew that we wanted something that would be able to handle nodes arbitrarily going down and the ability to grow the cluster flexibly.
Specifically, when requests come in, we have a bunch of app servers which handle most API requests; depending on what part of the system they’re hitting: whether they’re trying to access their app’s data, whether they’re trying to write something to analytics, or whether they’re trying to queue something, it’ll be handled in the background. That gets routed to the appropriate data store, and the request comes back.
From a developer perspective working with your platform, do I not need to worry at all about the data model, the schema I’m working? Can I just get coding and then pass handles under the covers, where data gets stored?
Yes. The only thing you need to worry about is that once you define a column as a particular data type, if you start storing integers under keys, you won’t be able to suddenly start storing strings; but otherwise, when you first start playing with a Parse object, it’s intended to just feel like a dictionary. You can just put objects in and they get saved to our backend.
Again, to make it even easier for developers, we provide save-in-background functionality. We also provide a “save eventually” option which allows the request to be serialized in case the device doesn’t have connectivity at the time, and sent to the server when the device is able to connect to the internet again.
As far as the business goals for Facebook, you want me as a developer to build on your platform, I take it, and that’s what Parse’s reason for being sounds like it is. What kinds of applications are you seeing that people are building on top of your platform?
We see a lot of game companies that use Parse for Facebook games. We see a lot of photo-sharing apps actually, which may just be an indication of the overall app economy as a whole.
I think my favorite example is that Sesame Street recently built a Cookie Monster app and an Elmo app. They’re a great example of a company where, you can imagine, they don’t want to spend time and engineers worrying about servers and file storage and billing. They want to devote their software engineers to building the best mobile experience they can for their users, and we can handle all the backend for them to have it be a seamless experience.
Earlier when you talked about needing a solution under the covers that scales up and scales out, do you know what your environment looks like? Are you running in different availability zones? Are you running on spinning disks or on solid-state disks? Anything like that is pretty useful to the community as well.
We’re using ephemeral storage on AWS, and are across three availability zones.
As far as the volume of data that Cassandra is handling for Parse, can you give us some insight into how many clusters you have and how much data each one is storing?
Because we’re only storing basic data for analytics, everything is counts and counters right now. It means our data is actually much more dense than someone who’s storing more complex information.
We’re currently using 12 nodes, and I believe somewhere on the order of hundreds of GBs, so nothing unreasonable. Our use right now is not huge, but we also have been slowly ramping up our analytics product and recently released custom analytics, which we expect will grow over the next several months as users start taking advantage of able to store free-form analytics. We expect that our total usage will grow in the near future.
It would be great to come back and maybe do this again in five or six months, and get an update on how that user adoption has been and how your data gross has been as well.
Yeah, I think it will be. Hopefully we’ll have some good numbers.
By the way, which version of Cassandra are you running right now?
I believe 1.1.8 … I think 1.2 came out right after we felt like our system was stable enough to start using Cassandra internally, and we didn’t feel like we were ready quite yet to move to 1.2.
Is that in the plans, or are you going to skip that and just go to 2?
I know there are a number of things in 1.2 that are tempting. I think, at this point, we will explore that as our needs also grow. Right now I think 1.1.8 is solid enough to make sure it can handle all the growth over the next few months, and we have enough on our plate with the move to Facebook.
I bet it’s a crazy time over there. Definitely with 2.0, the driver support is just getting better and better, so that’s something you mentioned earlier that you may see benefits from, some of the DataStax drivers as you migrate. Thank you so much for taking the time for talking with us today.
Cool. One more thing: I have to thank you guys. DataStax documentation was immensely helpful, essentially in just becoming familiar with the ability and limitations of Cassandra, so I wanted to thank you guys for that also.