Chaordic works with personalization, providing recommendations for e-commerce services. The biggest e-commerce sites in Brazil use our systems to serve up personalized buying recommendations for their users. This is basically what we do and my role is Lead Engineer of the Data Platform team.
For the data platform we have actually made a whole transition from MySQL to Cassandra. That was in by the end of 2011, and we were starting to have big clients here in Brazil. Today we have nine of the 15 biggest e-commerce companies as customers.
We started growing a lot in 2011, 2012 and then MySQL was difficult to scale for us because all our recommendations are based on events that the user generates and their interaction with the site, what they click, what deals they look at, what they buy. All this information needs to be stored for us to give good recommendations. It was getting very hard to scale with MySQL, and this was the main motivation for why we migrated to Cassandra. We have some different data technology in the company but with this migration we actually migrated all of our application data from MySQL to Cassandra.
We did some small benchmarking with MongoDB and HBase also, but the ease of scaling and the write performance were the main reasons we chose Cassandra. Low writing latency, which is a result of the Cassandra architecture, was key for us, since we need to log a huge amount of events to generate recommendations.
Doing the migration, we developed acceptance tests and unit tests for guaranteeing that this migration would really work. Everything was working as before with the SQL solution. It was really essential for us to have automated tests in the application stack for guaranteeing that this migration would work. We also needed to learn a lot about how to operate and maintain the Cassandra cluster and and make sure that the data was consistent.
Now, with Cassandra, all our solutions are based on Amazon Web Services and other cloud based technologies. We have two data centers: one data center is for online applications, were we plug in our OnSite solution. The other cluster is for batch processing and data analytics with Hadoop and Hive. We use basically two data centers, 24 nodes in one data center, 24 in the other, and this way we can actually make different kinds of applications in those different data centers. Most of the calculations for generating recommendations are done in batch. The calculations we don’t want to interfere in the operation of other products with this separation. We do this in the secondary data center to avoid read latency problems.
We are running Hadoop Map Reduce on top of Cassandra for the most general use cases, like data archiving, dumping and simple analytics. For other applications we actually dump the data out of Cassandra: we use Elastic MapReduce on top of Amazon infrastructures, which gives us a lot flexibility for our recommendation algorithms; in the same way, for processing ad-hoc queries we actually upload data to HDFS and use HIVE.
Since our initial launch we have recently been migrating our legacy data model from thrift to CQL and it has dramatically improved developer productivity as well as provided our application some relevant performance gains. We have also been experimenting with SSD-based hardware with a subset of our data model and have been observing some great performance improvements so far. So, definitely CQL + SSDs is a sweet spot for Cassandra where we will be heading in the next few months.