Andrea Gazzarini Software Engineer at @Cult
I’m Technical Lead at @Cult, a small company which works in libraries and bibliographic domain. As consequence of that, main products are Library Management Systems and Online Public Access Catalogs an example being Bibliographic catalog Trentino.
The company is located in Rome and has a lot of customers: the most important are the Pontifical libraries network in Rome and the Trentino library network (about 700 interconnected libraries)
As a Technical Lead I have the responsibility about frameworks, tools and middleware we should use in our projects. In addition, I never left my real passion: coding, so I’m an active participant of the development team.
NoSQL solution for bibliographic data
Basically we need something scalable, reliable, fast and easy (this is our first experience in the NoSQL world).
Cassandra was the ideal compromise because, in our opinion, hides very well a lot of complicated details: setup is very very easy and basically the marginal cost of installing a new node is very low. That is extremely good, from a development / productivity perspective, because it allows us to have, easily, a ready-to-use environment for working.
Specifically, I was very very attracted from its eventual consistency capability, which allows us to configure and tune appropriately the store depending on the concrete deployment context. Last but not least with Cassandra, scalability, reliability and performance are very good, so at the end it was a quite obvious choice.
I think Cassandra perfectly meets all our requirements.
Through Cassandra to RDF
The project where we are using Cassandra is a conversion pipeline which gets bibliographic data in input and outputs RDF triples.
The total number of records to process is quite big. In addition, while the input data (the bibliographic format) is binary and concise, the RDF output is, due to its nature, very huge. That’s one of the most important aspects that led our choice about the storage.
Output data is stored in Cassandra by means of CumulusRDF, an RDF store, which also provides a SPARQL endpoint, so once the data is in, users can begin to query it.
Currently we are using Cassandra 1.2.x as underlying storage for CumulusRDF.
As said before, the output are RDF triples and we need something that understands this kind of format. At the same time, having a lot of data, we need something with fast writes, scalable and reliable. CumulusRDF was an ideal choice because understands RDF and uses Cassandra as underlying storage.
We estimated a single bibliographic record can produce, with our current configuration, up to 50/60 RDF triples and an initial number of records of 5 millions, but that is expected to grow when new customers will join the project with their data.
Another important thing that has driven our decision is the community, which is very very active. This is important, in terms of support, if you have some doubt or in order to avoid to reinvent the wheel each time.