PagerDuty is the central hub for on-call and operations dispatch. At its core, it ties together all your monitoring services into one place, manages your on-call schedules, escalation policies, and notification methods and ensures that if something is wrong in your service, the right person gets alerts so they can act quickly to resolve any issues. I personally work on the pipeline of alerts that starts with our monitoring integrations or HTTP and email API and ends with a person getting a call, SMS, email, or push notification.
Our business is to alert people when they’re having problems so we need a high standard for uptime of our integrations and deliverability of alerts; there’s little value is an alert service that only works sometimes. So we needed to be fault-tolerant to catastrophic regional failures. Cassandra’s tunable replication and consistency let us define and implement these policies and be fault-tolerant in precisely how we need to be.
Cassandra is essentially the platform that we built our alert pipeline on. This pipeline is broken into multiple stages and services, but each is backed by Cassandra so that we can build on the reliability it provides.
Stability is absolutely the top benefit we receive from Cassandra. It’s hard to build a stable service if the bottom of the stack isn’t stable. Cassandra functions as a solid base for our applications.
We evaluated Cassandra 3-4 years ago and found it to be the most mature and most suitable for cross-DC deployments. We’re now using several Cassandra 1.2 clusters, each with nodes, in 3 data center regions. Each cluster is 5-10 nodes with a 2-2-1 RF.
Background anti-entropy and load issues can creep up very suddenly and without warning if you’re not looking out for the right issues. Stay ahead of your depleting capacity by scaling in advance, a task that’s relatively easy to do with Cassandra.
I went to the Cassandra Summit in San Francisco this year and was really impressed by many of the speakers there. A lot of speakers were very candid about their experiences and lessons learned which were really useful for my own sake.