January 20th, 2015

Robert Stupp, Apache Cassandra Committer.
Robert started his professional IT career in 1995 and is specialized in high-volume backend data processing with strong knowledge about coding and networks. He’s a contributor to the Apache Cassandra open source community and active with Cassandra’s 3.0 feature, “user defined functions.” Besides that he conducted Cassandra presentations and workshops as part of his previous projects.

Back in 2014 I took a stab on implementing Apache Cassandra 3.0 feature User-Defined-Functions – it developed nicely in the last months.

We have added a lot more functionality, made things clearer and so on. As I mentioned in the presentation: it’s stuff, that changes – even shortly before release. When Cassandra 3.0 is going to be release you can expect a blog article about UDFs again that covers the whole stuff.

UDF – Current Status

Until now it supports execution of small pieces of user code in Java (or a scripting language like JavaScript) and aggregation of data using your own aggregation functions.
But – UDF is not for doing something like map-reduce or doing expensive analysis stuff. Keep in mind, that UDFs are executed on the coordinator node (the node in your C* cluster, that received your query).

There are thoughts about moving UDF processing to the nodes owning the partitions if a win in terms of e.g. execution time is certain. But please do not expect that for 3.0. It’s just an idea right now.

UDF – Red flag

I want to put a big red flag (“don’t do it“) on using UDFs with scripting languages. Although it is a nice feature, it is also an expensive feature. Why?

Java source UDFs are directly compiled to byte code – Java UDFs can be immediately invoked without any indirection – without reflection or invoke-dynamic. It has nearly-zero invocation latency and can be optimized by Hotspot.

For scripted UDFs this is not true! These are invoked via the scripting language’s SPI implementation, which maybe has to convert types, lookup functions, eventually just interprets the functions and has to go back into Java if you work with collections, tuples or UDTs. For example: invoking a JavaScript UDF takes approx. 1000 times longer than invoking a Java UDF.
So – IMO scripted UDFs are nice for quick prototyping – but should be replaced with Java UDFs in production code.

Cassandra 3.0

It is mostly certain that these features find their way into the Apache Cassandra 3.0 release:

  • Core User-Defined-Functions
  • User-Defined-Aggregates
  • Functional indexes
  • Permissions for functions
  • Support for this in cqlsh


Thanks to all those people who gave me a lot of useful tips. I’d like to sum them up:

  • Prepare you presentation – take your time. I prepared this presentation months before the Summit. For this one, I also asked others to do a review.
  • Don’t put too much onto the slides – focus on the „big thing“
  • Train your skills on local meet-ups
  • If in doubt: less is more. Don’t go too much into details. You can always explain some specifics, that people are interested in, later during Q&A or in a chat. A longer Q&A part is better than to overrun the (fixed) time slot.
  • Use a real presentation remote 🙂
  • One of the best tips IMO: Start your presentation by starting with a loud welcome.

During other presentations and trainings I’ve used my iPhone as a remote for Keynote – it worked nicely. But on the Summit, the iPhone always changed the orientation. So when I wanted to go forward, it went backwards. Not funny…

Lesson learned: use a “real“ presentation remote – you just need one big button to advance slides. 🙂

Robert Stupp
Committer to Apache Cassandra

User-Defined-Functions presentation at Cassandra EU Summit 2014” was created by Robert Stupp, Committer to Apache Cassandra.