Building an Infinitely Smart AI: Powered by the World’s Largest Scalable Database: Apache Cassandra (Part 1)
Most people have by this point have heard about “GPT” , more popularly known to most as a catchphrase from South Park as a way to get difficult things done easily. “Just use ChatGPT Dude.” Though ChatGPT and other LLMs like Anthropic’s Claude, Cohere, and Google’s Bard are pretty cool, they lack accuracy because they have data that has been “pretrained,” hence the acronym “Generative Pretrained Transformer” (GPT). In recent months there have been some great open source projects like Langchain, LlamaIndex, AutoGPT, and even Microsoft’s Semantic Kernel which are making LLM model’s a great tool in software development for making “intelligent” apps with “LLM Inside” without needing to know machine learning or data science!
This series of posts will show you how to build an infinitely smart AI powered by Apache Cassandra without ever touching any machine learning code. How’s that possible? Read on, dude.
Traditional Machine Learning in Real Time
In “Traditional Machine Learning” , which is a new term only because of the advent of “LLM” based machine learning, is about bringing data into most likely some sort of Neural Network to create a model. This approach is still relevant today because there are cases in which a “traditionally” trained ML model is going to outperform a large language model because it’s purpose built.
Neural networks have become the main way people train ML models using frameworks like TensorFlow and PyTorch to crunch billions of operations through matrices that represent neural networks in math. NNs can be used to detect faces, determine if a picture is a hot dog or not, or to identify a sentiment off some text.
In this world of “traditional” machine learning, Apache Cassandra can provide a very strong foundational database that can be used in all aspects of a platform’s machine learning training, evaluation, and ongoing training.
- Data Engineering / DataOps – Cassandra can be used to store an infinitely large amount of of real time events and to keep cleaned training outcomes from data wrangling process.
- Data Science / ML Engineering / MLOps – Cassandra can be used to store a scalable feature store used in both batch and realtime ML Model evaluation, and to store the outcomes of an evaluation / prediction for APIs and Applications to consume in real time, or in a data lake / warehouse for business intelligence.
- Application / Software Engineering – APIs and Applications can interact with Cassandra in a transactional timeframe while all the training, evaluation is happening in the background without any loss of speed.
The reason Cassandra can do this without any latency is because of the way it’s been built from the ground up. It has a peer-to-peer architecture at the node level and at the datacenter level. Data can be in synch across datacenters in full as processes for machine learning, transaction processing, and reporting are happening as different workloads.
There is a lot more to cover on how and why Cassandra is the best database for real-time AI in the context of “traditional machine learning.” Read more on “Architects Guide to using NoSQL for Real-time AI.”
Assuming that you can get on the bus about this, let’s talk about how Generative AI models can be used and how it complements traditional machine learning.
Generative AI & LLM as a Router
Generative AI, including large language models (LLMs), text-to-image, text-to-speech technologies, and more, represents a significant shift in the way we utilize and interact with artificial intelligence (AI) / machine learning (ML). These technologies have the power to create new content, which can range from written text to images, speech, and even music.
Imagine if you had an assistant who could not only answer your questions about a vast array of topics but also write essays, create graphics, and even compose music if you asked. This is the kind of potential we’re talking about with generative AI. Now, let’s break it down a bit more.
Large Language Models (LLMs): These models, such as GPT-4 from OpenAI, are trained on a massive amount of text data. They learn to predict the next word in a sentence, which, simple as it sounds, enables them to generate coherent and contextually relevant sentences. They’re like a much, much smarter version of your phone’s predictive text. But instead of just suggesting the next word, they can write a whole essay, story, or technical report. LLMs are extremely versatile and can be applied to a multitude of tasks that traditionally required specialized models, like translation, summarization, or answering questions. They can even generate Python code!
Text-to-Image Models: These generative models, when given a textual description, can create a completely new image that matches the description. For instance, if you tell such a model to generate an image of a “red double-decker bus on a sunny day”, it will create a new, original image that fits your description. It’s almost like having a digital artist at your command!
Text-to-Speech Models: These models can convert written text into spoken words, and they’re the technology behind the voice of virtual assistants like Siri or Alexa. The quality of these models has improved to the point where it can be hard to distinguish the generated speech from a real human voice.
Now, how does this all tie in with the concept of AI model serving and changing the way AI/ML is done?
LLMs can act like a “router” between the user or an API (Application Programming Interface) and an ensemble of models. Here’s what that means: Imagine you’re running a company that has various AI models doing different tasks – one model recommends movies, another translates text, another answers customer support questions, and so on. Traditionally, you would need to write separate code to handle the input and output for each of these models, and figure out which model to send a request to based on what the user is asking for.
With a LLM acting as a “router”, you can simplify this process. The LLM can understand the user’s request (because it’s really good at understanding and generating text), figure out which model is best suited to handle the request, send the request to that model, and then take the response and put it into human-friendly language. The LLM can sit at the interface between the user and all your other models, coordinating everything and making the whole system more efficient and user-friendly.
This approach changes the way AI/ML is done because instead of needing to build a new, specialized model every time you have a new task, you can use an existing LLM and, if necessary, a small, specialized model. The LLM takes care of understanding the user’s requests and generating responses, while the smaller models can be more specialized and efficient at their specific tasks.
Training Generative AI
Using the same general principles of training ML Models traditionally, and using the Transformers architecture, anyone with sufficient ML engineering knowledge, vast amounts of computing /GPU power, and large amounts of really good data, you too can create your own “GPT” model, or fine tune existing ones.
The reason you’d want to do this is because the data in a pre-trained model (GPT = Generative Pretrained Transformer), is as current as the last time it was trained. So, if you want to use LLMs in your application while giving users accurate information, you can’t just use an open source LLM model like StableLM or a publicly available API like OpenAI’s GPT-3.5-Turbo out of the box. You need to either Train, Fine-Tune, or Tame the model using your data.
Training a LLM: To train one from scratch, it could cost about $50 million dollars, and I would say Cassandra would be the place to store the training data, and for keeping the feedback from users on which responses are good / bad/ etc. This also takes many many hours at scale.
Finetuning a LLM: Finetuning is a lot less expensive, in the order of magnitude of $500 or even as little as $100. Imagine if you could fine tune your own personal GPT-like LLM model with data that is relevant to your project or business every week or even every night with fresh data. We would assume you need to know ML engineering and be pretty comfortable with this process. Apparently you can fine tune your own model now now for about $30 dollars on commodity 48GB GPU hardware in 24 hours.
Taming a LLM: Taming a LLM model is much simpler. You don’t really have to spend $50 million dollars or even $30 to get decent results. As long as you learn some good prompt engineering, have a good understanding of data engineering (ETL/ELT), and can program apps that talk to databases and APIs, you too can “Train your own dragon” so to speak.
Building a Smart App without knowing Machine Learning
The fascinating thing about LLMs is that since they are pretty damn good at a lot of stuff as it relates to text, they can probably do what you need to do without any modifications or even backing them up with specialized models like I mentioned before.
To set the record straight, what I mean by “infinitely smart” is that currently LLM’s have a context limit. They can only remember what you send them in your prompt. ChatGPT does some of this for you by tracking your history. If you were to use OpenAI’s GPT, Cohere, or Claude on your own, the API doesn’t remember what you asked it before.
The other drawback is that these LLMs are trained with public data. It has a general knowledge of what is out there but it doesn’t have the facts that are not on the internet. When you use the paid ChatGPT Plus platform, you get access to ChatGPT Plugins. These plugins are basically enhancing or augmenting the generative AI. This makes ChatGPT Plus a “Data Augmented Generative AI” or “Retrieval Augmented Generative AI.” What if you wanted to make your own augmented generative AI that is smart about your data but didn’t want to do it on the ChatGPT platform?
Retrieval augmented AI platforms only some data engineering knowledge to do some ETL from your source data into a Database. Though you can use databases that support querying via SQL, full text search, the best way these days is to use a vector database.
- Data Engineering / DataOps – Using basic data engineering you can store raw data from your sources in real-time or batch, vectorize them using an embedding model, and then store the vectors in a vector database.
- Application / Software Engineering – Using basic application / software engineering you can use the vector database to find similar / relevant data as it relates to your user’s query and then send that to a LLM to process and give you an answer.
- LLM Ops (Optional) – If you want, and depending on your comfort level you can continue tweaking your prompts (still not machine learning, just editing the text / structure of a prompt), or fine-tune / train your LLM model using the traditional ML approaches.
This general approach is now being referred to as “Retrieval Augmented Generative AI” or “Data Augmented Generative AI.” There are various approaches to doing this.
- ChatGPT Plus Plugin – Any API with a manifest explaining what the API can do can be added to ChatGPT Plus. Your APIs can be in any language or framework as long as they expose the API via REST and document the endpoint using OpenAPI spec. Here ChatGPT is the router and is enhanced by either the market of plugins available or the ones you make. (More to come on this in another article, but until then check out a Cassandra powered ChatGPT Plugin)
- LangChain – A Python (and now JavaScript) based framework implements the “ReAct” pattern of Reasoning/Action which gives developers shortcuts related to prompt management, construction & caching , loading and Vectorizing data, and a framework to use the LLM as a router to other APIs and data sources. (More to come on this in another article)
- LlamaIndex – A Python based framework which focuses on giving developers shortcuts to getting data from various sources such as files, APIs, SaaS applications and getting them into a Vector database and using that Vector database to augment the LLM. There’s more to it than that obviously, but this framework seems the most mature in how to chunk / summarize documents so that it can fit into the context limits of the model you are using. (More to come on this in another article)
- Semantic-Kernel – A relatively mature Python and C# based framework open sourced by Microsoft that breaks down the process of using a LLM as a router into Functions, Skill, and Memory. Very cool how they allow you to use prompts inside “functions” which can then call other functions. Trust me, if you truly understand why that is really cool, it will blow your mind. (More to come on this in another article)
Data Augmentation / Retrieval Augmentation with Cassandra
An augmented generative AI is a type of artificial intelligence system that combines two key elements: generative models and retrieval models. The generative part refers to the AI’s ability to generate new content, like writing a paragraph or creating a piece of art. On the other hand, the retrieval part refers to the AI’s ability to pull in relevant information from a vast database or collection of data.
Now, how does a database like Apache Cassandra come into play here?
- Data Storage: AI systems, particularly generative ones, need to process vast amounts of data. Apache Cassandra, with its distributed architecture, can store large amounts of data across multiple servers, providing a scalable solution for storing the large datasets that AI models need for training.
- Data Retrieval: In a Retrieval Augmented AI, the system needs to pull in relevant information from its data sources quickly and accurately. Cassandra is designed to handle high-throughput, low-latency operations, making it an excellent choice for real-time data retrieval.
- Scalability: AI models often need to scale up quickly as more data becomes available or as the demands on the system increase. Cassandra’s distributed architecture means that it can scale up easily by adding more nodes to the system. This allows for the handling of increased data loads and more complex AI tasks without a drop in performance.
- Fault Tolerance and High Availability: In real-world applications, downtime can have serious consequences. Cassandra’s design allows for data to be replicated across multiple nodes, ensuring that if one node fails, the data is still available from another node. This makes it a reliable choice for critical AI applications.
- Integration with AI Tools: Cassandra can be integrated with various AI and machine learning platforms, making it easier to use it as a back-end for AI applications. This allows data scientists and engineers to leverage the power of Cassandra’s distributed storage and retrieval capabilities directly within their AI workflows.
Out of the box, so far Apache Cassandra’s NoSQL approach which allows users to get data only if they know what to look for and has until recently only had inklings of ability to “search” for data. Soon, Cassandra 5.0 will have Vector Search!
The recently released Cassio open source project will make using Cassandra for LLM workloads easier by wrapping Cassandra CQL and the upcoming vector capabilities in a simple library. This library can be used in your own framework or in one like Langchain, LlamaIndex, Semantic-Kernel or even in the “AGI” projects like AutoGPT and BabyAGI
This is a HUGE deal. This means that your multi-terabyte or multi-petabyte database that runs on Cassandra can be able to power an infinitely smart AI because your AI system is now backed with relevant data that is fresh and accurate. As the data accumulates, there’s no need to shard your data or indexes.
In conclusion, Apache Cassandra’s features such as scalability, fast data retrieval, and high fault tolerance make it a suitable choice for powering a Retrieval Augmented or Data Augmented Generative AI platform. The ability to store and access vast amounts of data quickly and accurately can greatly enhance the performance of these AI systems. Upcoming articles in this series will continue to expand on these topics as they relate to Cassandra and LLMs.
If you are interested in learning more about how Cassandra can be used in LLM world, register for the upcoming event previewing Cassandra’s Vector Search and the in person bootcamp in Santa Clara. See below.
Upcoming Events
Resources
- Cassio.org – A Python based framework that simplifies using Cassandra’s core CQL and upcoming Vector capabilities for the purpose of LLM/Generative AI. In addition to the “cassio” python library, will soon have capabilities in LangChain, LlamaIndex, and beyond.
- NoCode Data & AI Meetup – A virtual meetup focused on talking about using LLMs in the easiest way possible, and if possible without code. See past videos on Youtube.
- Kono “Robot Assistants and Experts” Blog – Curated links, latest videos, and blog posts. Also fill out the survey if you are interested in future in person bootcamps.