Big Data at Accenture started about two and a half years ago, that’s when I joined. The main objective is to explore the role of distributed technologies in corporate enterprises to see if there is a role for such technologies, and how they can fill the gaps that are left when using traditional technologies or databases.
That’s the main purpose, and mainly we’ve been looking at projects involving Hadoop and Cassandra along with some of the other technologies. But these are the two main that we have had a chance to focus on and most of the projects have been around these two technologies.
Most of the conversations start off with use case because people have been bombarded with a plethora of distributed technologies and always the first thing that they have to do is select the right technology.
Specifically for Cassandra, the main use case has been around ingestion. If you have a need to ingest data at a really phenomenal speed, and you want some sort of high availability features, redundancy and reliability Cassandra shines really well. One of the projects we were working on was using a traditional relational technology. The speed of incoming messages was just too much for the system to handle. The other thing was the system was architected across two different geographically separate data centers.
Collecting the data from the two data centers and combining them into a single source of truth was becoming a very difficult proposition. In a case like this, Cassandra’s ability to ingest the data at high speed as well as providing a singular source across two completely different data centers over a thousand miles apart really fit the bill very well.
I think in this case specifically in the context of the new technologies, it’s a different ball game compared to something like Oracle because those technologies have been around for a long time. If somebody makes a decision to install and use Oracle or SAP, it’s a well-understood decision.
But in the new technology space, there are so many questions that clients have, and these technologies have not yet been tried out to that extent. Most of the times we get involved right from the beginning or conception phase all the way up to the implementation. Right from proposing that, “Okay, if you are having a problem like this, then Cassandra is the solution to try out” and then doing a pilot or proof of concept demonstrating that it can achieve the speed that you were proposing or shooting for. Let’s say 5,000-20,000 messages per second, and then showing that it can work across different centers.
Also we get involved in the acquisition of hardware which might involve specific or specialized components such as SSDs all the way up to testing and finally deploying to production. This entire assembly line or chain of task, we’ve seen that we have to be involved as practitioners in this case specifically because it’s such a new technology.
One thing that I really like about Cassandra and the reason for its successful deployment is the elegance of software in terms of how it works during the implementation stage. The number of moving parts and the way it is set up and runs is really elegant compared to other systems. The amount of configuration that is needed, the amount of parts that are required to be run is very minimal in relation to some of the technologies. That makes it a really beautiful that is well thought-out in terms of its design.
The one area that people really need to focus on is the data modeling. Data modeling actually gains a lot more importance contrary to common belief in the case of Cassandra. One of the most important activities I would say would be to model to the schema based on what sort of question that business is interested in asking. How is the system is going to be used? What sort of reports and queries would be run against it? Then finalize the schema and putting time into that process upfront would really help in terms of a delivering a implementation.