Andreas Wagner Lead Developer at CumulusRDF
I work at the Karlsruhe Institute of Technology (KIT) – a large technical university in Germany.
At the Knowledge Management research group we are an interdisciplinary team of computer scientists, mathematicians, and industrial engineers. Our team is one of the leading groups in the Semantic Web research area. Research topics within the our group include: Semantic Search, Linked Data, Ontology Engineering, Data and Text Mining, and Service Science.
I’m a researcher in this group (currently finishing my PhD thesis) and lead developer for our RDF data store: CumulusRDF.
CumulusRDF was initiated by two colleagues: Günter Ladwig and Andreas Harth.
Path to Cassandra
We implemented a first version of CumulusRDF based on Google’s AppEngine. We looked at several other NoSQL stores. However, Cassandra offers a highly scalable backend. In particular, for write-intensive application its performance is quite amazing. Further, there’s an active community for Cassandra, which makes it very easy to find tools, documentations as well as support.
Cassandra provides us with an easy to use backend and lets us focus on our implementation and features. This way, we were able to develop a novel RDF storage solution – without having to worry about scalability and data/load distribution (over a given cluster).
We are using Cassandra as a backend for our RDF data store, CumulusRDF. RDF is an accepted W3C standard for publishing structured data on the Web. RDF allows to encode data by means of a flexible data schema, which makes it a prime candidate for heterogeneous Web data. However, this flexibility poses great challenges for data management solutions, such as CumulusRDF.
We used multiple versions of Cassandra: 1.1.x to 2.0.x.
Cassandra in action
CumulusRDF is currently used in two research projects: PlanetData, a European project on large-scale data management, and iZEUS, a project that fosters the integration of Smart Traffic and Smart Grid solutions.
I mainly work on iZEUS, where we have a CumulusRDF instance that manages real-time data from electric vehicles. We have a fleet comprising 30 electric vehicles, which communicate their status information (e.g., battery status) in real-time to CumulusRDF. All this data is managed as RDF and available as Linked Data as well as via a SPARQL endpoint. The former (Linked Data) allows to access data via simple and lightweight HTTP operations. The latter (SPARQL) gives you a structured query language – much like SQL – to express complex information needs.
Generally speaking, CumulusRDF offers a highly scalable RDF data store for write-intensive applications.
In our current iZEUS research project, we use a CumulusRDF instance over a cluster of 4 nodes, which are hosted at the university’s computing center. We log real-time data from a fleet of electric vehicles – leading to billions of RDF triples to be managed by CumulusRDF.
In future projects, we plan to deploy CumulusRDF to bigger clusters (featuring more than 100 nodes) and manage even larger data loads.
From my point of view, the best way to get started is to set up a local Cassandra instance and try to (re-)implement Cassandra code examples available on the Web.
Joining a community
I subscribed to the Cassandra mailing list and quickly noticed that the Cassandra community is extremely active. Questions are asked regularly, while being answered quickly and thoroughly. Also other activities, such as Cassandra conferences, are a good indicator for the liveliness of the Cassandra community.
CumulusRDF is an open source project – feel free to check it out.