Illustration Image

Using Cassandra and NVIDIA DALI

Patrick McFadin on April 4, 2023

Using Cassandra and NVIDIA DALI

In today’s world, with advancements in technology, it is essential to have efficient and quick ways of loading and preprocessing data for machine learning models. NVIDIA Data Loading Library (DALI) is a powerful tool that can help load and preprocess images for PyTorch or TensorFlow. However, to use DALI, the data needs to be stored in a suitable format or database. The Cassandra plugin for NVIDIA DALI enables data loading from an Apache Cassandra NoSQL database, providing a seamless experience to the users.

Apache Cassandra is a popular choice for storing large datasets, making it ideal for machine learning models requiring massive amounts of data. With the Cassandra plugin for NVIDIA DALI from developer and researcher Francesco Versaci, users can load data from Cassandra databases efficiently and process it for training models in PyTorch or TensorFlow. The plugin also provides a straightforward way to install dependencies, making it easy for users to get started with the tool.

About NVIDIA DALI

The NVIDIA Data Loading Library (DALI) is a library designed to accelerate deep learning applications by providing optimized building blocks for loading and processing image, video, and audio data. It can replace built-in data loaders and iterators in popular deep learning frameworks, and its building blocks are highly optimized for performance. DALI simplifies the complex, multi-stage data processing pipelines required for deep learning applications, which currently run on CPUs and have become a bottleneck for training and inference.

DALI tackles the CPU bottleneck by offloading data preprocessing to the GPU, and comes with its own execution engine that maximizes the throughput of the input pipeline. It transparently handles features such as prefetching, parallel execution, and batch processing. Additionally, data processing pipelines implemented using DALI are portable, as they can be easily retargeted to work with TensorFlow, PyTorch, MXNet, and PaddlePaddle. This eliminates challenges associated with the portability of training and inference workflows, as well as the maintenance of code that uses multiple data pre-processing implementations.

Source: https://github.com/NVIDIA/DALI

Loading Data into DALI

The easiest way to test the cassandra-dali-plugin is by using the provided Dockerfile. The Dockerfile includes NVIDIA DALI, Cassandra C++ and Python drivers, a Cassandra server, PyTorch, and Apache Spark. Users can build and run the cassandra-dali docker container using the provided commands. The container can run with an external data directory for better performance and persistence. This feature is handy when working with large datasets and requires a fast disk to avoid performance issues. Once the Github repository is cloned locally, you can build and run everything in a docker container.

$ docker build -t cassandra-dali-plugin .
$ docker run --rm -it --cap-add=sys_nice cassandra-dali-plugin

The plugin supports classification and segmentation tasks. For classification tasks, users can refer to the annotated example provided for Imagenette. The example provides a detailed explanation of how to use and optimize the plugin for classification tasks. For segmentation tasks, users can refer to the less annotated example provided for ADE20k.

In case users want to install the plugin on a bare machine, the plugin requires NVIDIA DALI, Cassandra C/C++ driver, and Cassandra Python driver. The installation commands for these dependencies are included in the Dockerfile, making it easy for users to install missing dependencies. Once the dependencies are installed, users can install the plugin easily using pip.

In summary, the Cassandra plugin for NVIDIA DALI provides an efficient and easy way to load and preprocess data from Apache Cassandra NoSQL databases for machine learning models. The plugin’s seamless integration with DALI and easy installation process makes it a helpful tool for researchers and data scientists working on large datasets. The plugin’s ability to support both classification and segmentation tasks adds to its versatility, making it a valuable addition to any machine learning workflow. The plugin allows users to focus on the more critical aspects of model building and experimentation by simplifying the data loading and preprocessing step.

Get started here: https://github.com/fversaci/cassandra-dali-plugin

NVIDIA DALI Docs page: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html

Become part of our
growing community!
Welcome to Planet Cassandra, a community for Apache Cassandra®! We're a passionate and dedicated group of users, developers, and enthusiasts who are working together to make Cassandra the best it can be. Whether you're just getting started with Cassandra or you're an experienced user, there's a place for you in our community.
A dinosaur
Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.
© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation. Sponsored by Anant Corporation and Datastax, and Developed by Anant Corporation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?