Open Source RDF Datastore CumulusRDF used for Real-Time Vehicle Data Monitoring Powered by Apache Cassandra
Andreas Wagner Lead Developer at CumulusRDF
I work at the Karlsruhe Institute of Technology (KIT) – a large technical university in Germany.
At the Knowledge Management research group we are an interdisciplinary team of computer scientists, mathematicians, and industrial engineers. Our team is one of the leading groups in the Semantic Web research area. Research topics within the our group include: Semantic Search, Linked Data, Ontology Engineering, Data and Text Mining, and Service Science.
I’m a researcher in this group (currently finishing my PhD thesis) and lead developer for our RDF data store: CumulusRDF.
CumulusRDF was initiated by two colleagues: Günter Ladwig and Andreas Harth.
Path to Cassandra
We implemented a first version of CumulusRDF based on Google’s AppEngine. We looked at several other NoSQL stores. However, Cassandra offers a highly scalable backend. In particular, for write-intensive application its performance is quite amazing. Further, there’s an active community for Cassandra, which makes it very easy to find tools, documentations as well as support.
Cassandra provides us with an easy to use backend and lets us focus on our implementation and features. This way, we were able to develop a novel RDF storage solution – without having to worry about scalability and data/load distribution (over a given cluster).
We are using Cassandra as a backend for our RDF data store, CumulusRDF. RDF is an accepted W3C standard for publishing structured data on the Web. RDF allows to encode data by means of a flexible data schema, which makes it a prime candidate for heterogeneous Web data. However, this flexibility poses great challenges for data management solutions, such as CumulusRDF.
We used multiple versions of Cassandra: 1.1.x to 2.0.x.
Cassandra in action
CumulusRDF is currently used in two research projects: PlanetData, a European project on large-scale data management, and iZEUS, a project that fosters the integration of Smart Traffic and Smart Grid solutions.
I mainly work on iZEUS, where we have a CumulusRDF instance that manages real-time data from electric vehicles. We have a fleet comprising 30 electric vehicles, which communicate their status information (e.g., battery status) in real-time to CumulusRDF. All this data is managed as RDF and available as Linked Data as well as via a SPARQL endpoint. The former (Linked Data) allows to access data via simple and lightweight HTTP operations. The latter (SPARQL) gives you a structured query language – much like SQL – to express complex information needs.
Generally speaking, CumulusRDF offers a highly scalable RDF data store for write-intensive applications.
In our current iZEUS research project, we use a CumulusRDF instance over a cluster of 4 nodes, which are hosted at the university’s computing center. We log real-time data from a fleet of electric vehicles – leading to billions of RDF triples to be managed by CumulusRDF.
In future projects, we plan to deploy CumulusRDF to bigger clusters (featuring more than 100 nodes) and manage even larger data loads.
From my point of view, the best way to get started is to set up a local Cassandra instance and try to (re-)implement Cassandra code examples available on the Web.
Joining a community
I subscribed to the Cassandra mailing list and quickly noticed that the Cassandra community is extremely active. Questions are asked regularly, while being answered quickly and thoroughly. Also other activities, such as Cassandra conferences, are a good indicator for the liveliness of the Cassandra community.
CumulusRDF is an open source project – feel free to check it out.
With many diseases, doctors have the benefit of a blood test that, more or less, definitively proves the presence of the disease. But for other conditions, such as sepsis–a bacterial infection state that kills millions of people each year–there is no single clear-cut test. But thanks to new big data techniques that can continuously monitor and analyze the interplay of more than 100 signs and potential symptoms of sepsis, hospitals are detecting the condition earlier, and saving both lives and money.
Sepsis is an inflammatory disease state that occurs when the human body initiates an overwhelming immune response to an initial infection. It takes a surprisingly high toll on people in this country and around the world. Conservative estimates have the disease effecting 1 to 2 percent of all hospital patients in the United States, or about 750,000 Americans per year, and killing up to half of those diagnosed. However, new research suggests that sepsis is actually much more prevalent than initially thought, and that it could be killing 3 million Americans per year, and between 15 million and 20 million globally.
The fuzziness around the numbers is part of the problem. It’s difficult to get a positive diagnosis of sepsis because it mimics so many other conditions. Its major symptoms–fever, chills, rapid breathing and heart rate, rash, confusion, and disorientation–are symptoms of other diseases and conditions. Once sepsis is diagnosed, intravenous antibiotics must be applied immediately to save the patient. These drugs are not cheap, and the cost to fight sepsis in the U.S. is estimated to be between $30 billion to $50 billion.
That a little known disease state could be wreaking such havoc on the lives and budgets of millions of people was intriguing to a small group of technologist in Southern California. Steve Nathan and Christopher Rosin were working in other areas of healthcare analytics, when they learned about sepsis.
“We were partnering with a company that provided automatic collection of the data that streams off the bedside monitors–the heart rate, the respiratory rate, etc. and we started to get interested in doing something predictive and interesting with that data,” Nathan says. “We stepped back and realized that, for a complex disease state like sepsis, there’s a lot of subjectivity and ambiguity for a clinician in analyzing those clinical variables.”
The problem is, you can’t just look at the vital signs coming off those bedside machines. “You’ve got to look at all of the available the data because you might find a signal in an unexpected place or a combination of places,” Nathan tells Datanami.
Amara Health Analytics was created with the initial goal of developing a system that could help clinicians identify sepsis in a patient earlier and more accurately than the traditional methods hospitals are using. The cloud-based system they built, called Clinical Vigilance for Sepsis, has been installed at four hospitals, and is on its way to being a big data success story in healthcare.
How It Works
Amara’s Clinical Vigilance system is composed of online and offline components. Much of the hardcore data science work that Rosin and the Amara team put into the system takes place in a proprietary big data repository filled with millions of historical patient records. Here, Rosin and his team trained machine learning algorithms to find the various signals that correlate with sepsis.
“We’ve built a predictive model that looks at all the available data and look at the earliest strong indication that we can find for a patient who will eventually be diagnosed with sepsis,” Rosin tells Datanami. “We’ve used machine learning techniques to look for the early signal that a patient’s headed to treatment and diagnosis for sepsis. We’ve really been able to find it. It’s hybrid of established guidelines from the Surviving Sepsis Campaign, and the predictive model.”
Whereas the Surviving Sepsis Campaign trains doctors and nurses to look at about 20 different data variables, Amara’s Clinical Vigilance system casts a much wider net, and looks at more than 100 different variables, which also includes facts derived from raw data. Inputs into the system include real-time telemetry data from bedside machines; structured data, such as such as medical codes and other numeric values; and unstructured data, such as doctor’s notes, operative reports, and discharge summaries. These are the sources of data that get collected by Amara’s system at runtime.
The Clinical Vigilance runtime is a pure Java-based application. Data for each patient is continuously collected and stored in memory. Doctor’s notes and other written data sources are run through natural language processing (NLP) routines to extract meaning. These NLP routines are critical to the system because some of the symptoms of sepsis, such as altered mental states, are subjective determinations that can only be deduced by understanding doctors’ notes.
When Clinical Vigilance detects that a patient is headed toward sepsis, the system sends a text alert to the doctor or nurse. That is the only clinical interface to the system. The company uses the DataStax Enterprise NoSQL database to store all the clinical data for the purpose of running queries and generating reports for hospitals. The company was leaning toward a NoSQL system because of the need to have flexible schemas, and finally selected DataStax’s Cassandra distribution primarily because of its strong Lucene-based search capabilities.
Saving Lives and Money
Amara’s capability to find signals across potentially millions of data points for a single patient is the source of its big data-driven sepsis breakthrough.
“It does not come down to one signal. It’s multiple factors and it tends to be different from one patient to another,” Rosin says. “Sepsis is a disease state that researchers have been broadly looking for one signal and diagnostic companies have been looking for the one biomarker, and no one’s really found it. And we haven’t either. There are different criteria that apply to different patients. It ends up being pretty complex, which is okay for us, because we have millions of patient records available to us for data mining. That lets us derive a model that has significant complexity, while still allowing us to verify it.”
Hospitals that adopt Clinical Vigilance are able to detect sepsis earlier, which translates into quicker administration of antibiotics and a shorter stay in the hospital. For a 300-bed hospital, the average savings is about $2 million in direct savings, as measured in the length of stay.
Amara’s system also has a low rate of false positives. “Many companies have attempted to do a sepsis alerting system, but almost all of them have suffered from high false positives,” Nathan says. “What we do both with natural language processing and machine learning algorithms is look at the entire context of the patient before the alert is sent out. Our specificity–the accuracy of the system across all adult patients–is about 99 percent, much higher than competitors.”
Amara is targeting sepsis today, but sees the potential to broaden its reach to provide early detection of other diseases in the future. In addition to specific diseases, such as acute kidney injury and cardiac conditions, Amara is considering a big data approach to identifying general patient deterioration with any underlying root cause.
The potential for hospitals to make use of big data is just now being realized. “The data that hospitals are now collecting on a massive scale, this EHR [electronic health record] data–it’s goldmine for applications like this,” Nathan says. “We’ve hit the tipping point now for medical records. It doesn’t mean that all hospitals are fully digitized yet. But it’s really underway.”
Alain Rodrgiuez Lead Architect, Data Scientist at Teads
Teads is an innovative adserver. Our goals are to create links between advertisers and publishers and to provide the technical part to allow broadcasting video ads in an “outstream” way, so no more video content is needed to play a video ad.
I am the main data scientist and architect, in charge of storing our tracking data; used to provide real-time statistics and data used by our algorithm to chose the best ad to broadcast, millions of times, every day.
Cassandra at Teads
We use Cassandra in 3 distinct ways:
- We use a lot of counters to provide in real time statistics of the number of people exposed to any ad, or any website, and more.
- We store raw data to be able to grant (someday) more detailed statistics, crossing more dimensions, in a batch way, using hadoop.
- We store data to be able to give our algorithm data it needs to chose the best ad to display following set rules.
We started using Cassandra 0.8.0 about 2 years ago. We upgraded to each major release and are now using Cassandra 1.2.11.
We liked Cassandra’s main characteristics:
- No single point of failure (We has some SLA, and any down time is really expensive)
- Horizontal scaling (Using AWS, this is very easy and efficient)
- Write efficiency (We track a lot, so our use case fits well.)
- Presence of counters
- Peer to peer clustering, with no master/slaves.
We had no time to benchmark at this time to help us choosing the right technology so we did it after reading a lot on the web, and we chose Cassandra over HBase, mainly because of our use case which implies a lot of writes.
We now have one DC in AWS eu-west, with 28 nodes m1.xlarge, using Cassandra 1.2.11 and holding 300 GB data each.
We also have a replication factor set to 3 and make both reads and writes with a consistency level set to QUORUM.
We already tried opening a new DC and will go live with the second DC in a few weeks and a third should follow.
Advice on getting started
For the operational part, which is a very important part while using Cassandra, I think it is mandatory understand a bit of Cassandra internals. You need to understand how things work under the hood to be efficient. Cassandra needs a good configuration, and this configuration highly depends on your use case. You can’t just do things as other people do, it won’t necessarily work well for you.
So take the time to understand how this beautiful tool works, or you will regret it later.
Apache Cassandra community
The Cassandra community might be one of my favorite things about Cassandra. The community is active, all the time, and ready to help through multiple channels (irc, mails, github …).
Numbers can sometimes be more explicit than words: according to my Grokbase Cassandra user profile, I sent 274 mails to ask or answer questions. I am on the top 10 users using the mailing list. I almost all the time had answer to my questions and helped a lot of people.
Well, as you may have understood, the community is in the center of my Cassandra usage, and I think it should be this way for any user.
A Spoonful of Apache Cassandra Keeps the Database from Going Down; Recipe Site Whisk Migrates from MongoDB
Viktor Taranenko Senior Engineer at Whisk
Whisk is about creating grocery lists from recipes, allowing people to buy ingredients in a very straightforward way. By applying various techniques, we are trying to “understand” the recipes and ingredients required to cook them. We work with companies like FoodNetwork and a number of recipe publishers and product brands. Here I’m a senior engineer, looking after Whisk’s back-end services and infrastructure.
MongoDB to Cassandra
Recently, with having a growing number of recipes we were struggling about making our internal processing fast. So we migrated and refined some parts that relied on MongoDB before. With great concurrency and fast writes Cassandra helped us a lot. MongoDB, I’d say, only performed good while fitting the data in RAM, but satisfying this requirement was quite expensive for us. Now we feel not limited to the amount of storage in Cassandra.
Moving things from MongoDB to Cassandra is still in progress. We’ve already fixed the most critical stuff and improved the finance by moving to Cassandra
The horizontal scalability of Cassandra is just great. Cassandra’s columnar nature allows us design the schema for very efficient queries and updates. It’s so easy to the Cassandra cluster deployed, configured and running. Thanks to DataStax for providing good Cassandra documentation, which helped a lot.
Currently it’s a rather small cluster with six machines. Our first Cassandra use case was about storing precomputed products data data without any limits. And doesn’t matter how much do we store – response times are predictable now. One of our recent use cases is for our ingredient graph. We decided get it with Titan running on top of Cassandra and ElasticSearch. The main reason was the easy replication and horizontal scalability derived from Cassandra.
Whisk’s ingredient graph
Currently we expect our graph database to provide us an ability to maintain core data easily and reuse it on other nodes. Wine recommendation is a very good example, where we can have wine data attached to base ingredients and share it between subtypes. For example, when you have “Orange” with a lot of metadata attached to it like wine and nutritional information, we can easily reuse it on related ingredients like “Orange Juice”. We found the graph database very good to represent our ingredients relations and make reaching that information very quickly.
Thank you guys in DataStax for a great product. DataStax has made quite a big effort to make the documentation be great, to organize conferences and webinars. Recently I’ve been to Cassandra Summit in Europe which took place in London and that was my almost initial point of working with Cassandra. It was great time to share experience, to see a lot of use cases from presentations and eventually to learn some useful details from the second-day workshop.
About This Presentation
Starting with Version 1.2 Cassandra has made it easier to store more data on a single node. With off heap data structures, virtual nodes, and improved JBOD support we can now run nodes with several Terabytes of data.
In this talk Aaron Morton, Co-Founder and Principal Consultant at The Last Pickle, will walk through running fat nodes in a Cassandra cluster. He’ll review the features that support it and discuss the trade offs that come from storing 1TB+ per node.
About Aaron Morton
Aaron Morton is Co-Founder & Principal Consultant at The Last Pickle based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.