Target had been using Apache Cassandra since 2014, but in 2018 they faced new challenges of deploying an individual Cassandra cluster inside all of their stores and running those clusters in Kubernetes. Cassandra is a database for persistent workloads, and it had not been compatible with ephemeral environments in the past. Target had to deploy Cassandra as a Docker container in Kubernetes while maintaining stability and consistency.
The first challenge Target had to overcome was the clustering component. They created a “peers service” that was registered in Kubernetes and headless. Rather than being load-balanced through a single IP, the service was a passthrough directly to all of the IPs behind it. When a Cassandra container started up, it did an easy DNS lookup on this service as part of the Docker container entry point. This allowed the nodes to discover one another and form a cluster without knowing the IP addresses assigned to any of the Cassandra pods or how long those IPs would be valid.
The second challenge Target faced was nodes coming up unclustered when they deployed to a store. They wrote a small init container called the cluster negotiator that attempted to detect other init containers of the same type and communicate what to do next. The init containers kept checking for a Cassandra database to come up and register with the peers service. When they found one, they all exited and allowed the Cassandra entry point to run, and then it found a seed node and joined the cluster.
Persisting the Cassandra database was another challenge. Target mounted their data directories to static paths on a SAN that lived outside the Kubernetes cluster and mapped them into the Docker pod, so even if the pod got destroyed when it came back up and looked in its /data directory, the database information was just as the old pod left it. They also split the nodes of their Cassandra cluster across the physical node structure of Kubernetes, and then they did the same static path mapping for the data directories into the Kubernetes host volume. This limited them to a three-node Cassandra cluster but ensured that the data from the database was persisted.
To monitor the Cassandra clusters, Target attached a Telegraf sidecar with the Jolokia plugin to each deployed node. They supplied the sidecar with an environment variable that told them the location it was in, and they streamed their metrics into the normal pipeline used for applications. With that, they built dashboards that provided them with a detailed view of the Cassandra cluster health per location and set up alerts that told them when their cluster or node performance was outside of healthy and let them know the node and location of the problem.
Target also wrote their own little synthetic transaction monitor into a Docker image that deployed as a separate application in each location. It did a pretend purchase periodically and wrote out the success or failure as a metric. The monitor also attempted to make a Cassandra client connection to their Cassandra service so they knew if their app could use the database. It got a full cluster status out of each node, checked if all the nodes were clustered, and sent alerts through their alerting integration if any of them showed up in a down status so someone could investigate and fix any issues.
To allow their application partners to access underlying logs and Cassandra data without needing to interact with the Kubernetes CLI, know what namespace a store is in, or have the passwords and tokens to get access to all of that information, Target built a web app. The app let them find logs by store and pod, search them, query the Cassandra clusters in each store, and even fix some common issues like resetting a node with corrupt data or making an update to a schema in the cluster.
Through this process, Target built a platform that they knew how to deploy, could keep highly available, knew if it was working, and could support and monitor across all 1,800 of their geographically dispersed stores.