May 17, 2024
As we mentioned in this note, RAG (Retrieval Augmented Generation) systems are used to search for information in a plain text database (almost always) and pass it as context to an LLM (Large Language Model) to enhance its capabilities (knowledge or update). There are many challenges and decisions when building a RAG, which we will discuss below.
File Processing
If the goal is to convert to plain text, special consideration must be given to the technique used to transform other types of files. In the case of video, some libraries allow direct processing from the video while others require downloading the audio first.
When processing audio, there are different models available. All major companies have a model (OpenAI, Meta, Google, Amazon, Microsoft, IBM) but there are also companies specialized in this field (Deepgram, Assembly, Rev AI). In this variety, we find different models, accuracy, costs, and processing times. Many include the possibility of adding a whitelist of words to help the model recognize them, which is very useful for specific terms or proper names.
Regarding PDF processing, there are countless libraries available to do this with a fairly good standard level. If we use scanned files, it might be useful to look for a higher quality tool with OCR mechanisms or image enhancement features.
Lastly, regarding tables, there are two approaches to handle this source. One involves converting all the table information to plain text with markers (for example, in Markdown, JSON, or XML) and the other would involve developing an agent capable of transforming the query into an SQL query and executing it on the relevant database. However, in this case, we are talking about a much more complex design than what we understand by RAG so far.
Chunk and Overlap
Once we have the plain text, it's time to divide it into fragments. To do this, we need to define the size of the chunks (how many words or tokens each will cover). If the number is too low, the chunks will not be representative, and if it is too high, they will be too vague. Additionally, these decisions will impact both latency and system cost. On the other hand, we can also play with the overlap. This means having part of the tokens repeat between one chunk and the next. This way, the divisions are not so strict, and we can test if the same word with different contexts (in different chunks) is more relevant.
To decide the size of the chunks and the overlap, we can use different approaches. One option would be to define a number of tokens or a desired number of chunks. However, the meaning of text fragments is sometimes variable. Taking this into account, we can use character-based divisions (for example, line or paragraph breaks), sentence-based divisions (the same as the previous one but with periods), or meaning-based divisions (semantically analyzing each sentence). Or even a recursive combination of all the previous strategies. We can also think about varying the strategy according to the type of the original data source file. For example, if we have a video of a podcast with two speakers, it would make sense for the chunks to be fragments from a single person.
Embeddings
When we decide the best strategy for chunk and overlap, it's time to convert the words into numerical vectors. For this, we need to choose an encoder that converts text into embeddings. We can use an open-source or proprietary model, we can use the same one on which we will later build the LLM or a different one, we can use generic encoders or ones with fine-tuning for a specific task. As we have seen in other cases, beyond the precision of the encoder, we must consider its dimensionality due to cost and latency issues.
Vector Database
Now that we have all our chunks (with overlap) converted into numerical vectors, we need to find a database to store them. As with any database choice, there are several factors to consider. For example, the cost and latency for data insertion but, even more importantly, for querying. If it is an in-memory database (like Redis) or on disk, with the consequent trade-offs (higher speed at the cost of lower storage and vice versa).
On the other hand, when storing the embeddings we will also want to store some metadata of the fragments (it could be which section it belongs to, who was speaking, in what language it is written) that will help us filter when performing a search. In this case, we should define if the filtering will occur before the similarity search, after, or with a hybrid approach.
Lastly, one of the most discussed points about RAG system design has to do with the indexing mechanism of the embeddings: Flat, HNSW, PQ, ANN, or their variants are some of the most used. These indexes use different strategies to index the vectors according to their proximity in space to facilitate subsequent search or retrieval. As part of the system design, we must also consider indexing mechanisms for new documents (which we assume should happen quite frequently in production).
User Query
So far, we have overlooked the initial step or trigger of this whole process, which is the user's query. We can think that part of the RAG system's efficiency has to do with how the query is formulated to see what similarities will be found in the expanded database. However, user queries are often not the most appropriate or are not written in a way that facilitates similarity search. Therefore, there are techniques to improve search results. Some of these involve using an LLM to paraphrase the original query or splitting the query into steps and then merging them back together.
Retrieval
Beyond the fact that the index we use for the vector database is usually optimized for a certain similarity metric (such as Euclidean distance, Manhattan, or cosine similarity), there are techniques to improve the quality of documents retrieved by the RAG system. Many of these techniques involve reindexing the original documents. In some cases, expanding or shrinking the size of the chunk and incorporating more context, in other cases generating fictitious text similar to the original to increase the density of the search space (data augmentation), and in other cases generating indexing layers with summaries of sections composed of several chunks (or clusters of chunks), making the search happen in several consecutive stages.
Context Size and Consolidation
Another relevant point is how many documents we return in the similarity search. If the number is too small, we may miss relevant information. If the number is too large, it may be too vague, and the LLM may get lost in interpreting it. Additionally, once the retrieval has returned documents, there is a process of interpretation and consolidation of these different fragments. In the best-case scenario, they will be consistent and complementary, but if not, we must refine the mechanism to deal with possible inconsistencies.
The Future of RAG
It is clear that RAG systems have had a rapid evolution since the appearance of LLMs. Much of this can be explained by the need to complement the functioning of the models with new or private information. There is a debate about how long this mechanism will remain in use as context windows expand and access to tools like web search increases. However, RAG systems as mechanisms for searching relevant information in text databases will likely find other functions. Ultimately, what we are doing is manipulating text and its vector representations.