I’m CTO & co-founder at adsquare, Europe’s leading data platform for mobile programmatic advertising.
Our platform supercharges data-driven targeting. With our solution, advertisers and agencies can leverage data to reach their desired audiences and meet campaign goals and on the other side, publishers and third-party providers can on-board and monetize their data.
I’m responsible for platform development and lead the technology and data science departments. I’m a born-and-bred Berliner and my area of expertise is scalable backend architectures and big data.
At adsquare we’re using Datastax Community 2.1. We’re using Apache Cassandra for storage of aggregates and calculations from our real-time event processing system (spark streaming) which in turn is queried by various backend systems, as well as central data storage for our tile-based infrastructure (billions of data points).
We use Spark Streaming with Cassandra to post-process every single bid-request in near-real time in the most scalable way. Events from all different data centers in the world are mirrored via Kafka, which is by default a nice distributed storage.
As scaling is important for us, we needed a fast and solid solution, which can post-process data in terms of filtering, enriching, aggregating and storing data. It was important for us to have horizontal scalable components in every piece of our architecture, so Spark Streaming was the perfect match along Kafka and Cassandra. The main advantage of Spark Streaming is that it ‘s easy to develop and maintain, its blazing fast compared to batch processing in Map/Reduce and most importantly, it’s fun.
We evaluated Cassandra against Couchbase and MongoDB, which was already deployed in our system. In terms of motivational factors, to name a few, Cassandra gives us: excellent write performance, a more flexible query language compared to many document based databases, scales linearly to pretty much any size and performance requirement, supports multiple datacentres, relatively low latency, no single point of failure and Spark integration is excellent!
Our deployment currently resides in 1 data center, with 2 terabytes of data spread across 6 nodes. We’re just starting to use Cassandra and replace more and more existing databases and pure HDFS files with dedicated representations in Cassandra. We plan to rollout Cassandra in other data centers as well.
Cassandra gives us high performance, continuous updates, and querying of large time series datasets with strong transactional guarantees. Cassandra also offers way better querying support than a pure key-value store.
Definitely try to learn how Cassandra keys work by heart, think about key distribution and what your datasets will look like in a few months. Cassandra scales and performs well if used correctly, but it can’t magically fix your mistakes.
Always start by figuring out every single query your database needs to support and if you come from the world of relational databases, your brain will need to be rewired because the data is modelled in a fundamentally different way. Trust me, my brain was rewired the day I realized that what I’m actually dealing with is a REALLY fancy hash map.
Lastly, read articles and blog posts by people who run Cassandra in production to get real-world insights on tips, tricks and pitfalls and consider using compact storage, even on 2.1. For adsquare and me personally, Cassandra boasts a strong and helpful community, which is an invaluable source of information.
We’re a technology driven company and our asset is infrastructure – in order to improve our technology stack and infrastructure and build the most sophisticated and scalable data marketplace for mobile advertising, we make sure our team is made up of the best talent. We’re looking for experienced Big Data engineers, platform architects and high potentials who help us to build an outstanding platform. So if readers are interested in working for one of Berlin’s expanding, innovative tech startups they can check our current openings.