17 de mayo de 2024
Large language models (LLMs) can be much more than a chatbot or a conversational interface. With the current technology (as of May 2024), there are many tasks that can be automated or improved through their implementation. However, to achieve these results, there are still several technical challenges to address. If we consider the simplest version of an LLM (a conversational interface like ChatGPT), we can say that it lacks memory, often hallucinates, and cannot perform any action beyond communication through language.
Agents
To overcome these barriers, various tools and paradigms have been developed (and are still being developed). In our opinion, one of the most promising is multi-agent systems. But what is an agent? We could say that it is an autonomous entity capable of developing more complex interactions than conversation, capable of automating tasks, and doing so in a way that the result is equal to or better than that of a human. To build something like this, various components are necessary, which we will discuss below.
RAG
Retrieval-Augmented Generation (RAG) systems were one of the first enhancements to LLMs. Their original application was related to the fact that the models were trained with information up to a certain date, so to have more up-to-date data, it had to be explicitly provided. Another common use case for RAG is to provide the LLM with access to private information, whether from a person or a company, which was also not part of the model’s training data.
RAGs consist of various techniques for processing, indexing, and facilitating text searches from a database. As we saw, it is a way to provide the LLM with access to more information than it possesses from its own training. Basically, what we do is search the database for relevant information for each interaction and pass it to the model as context through different mechanisms. If you are interested in knowing more, in this article we have an overview of the various components of a RAG system, and in this other one, we delve deeper into some of the techniques that exist to develop them.
However, even with a good implementation, a RAG is nothing more than a way to add information not present in the training without qualitatively increasing the LLM's capabilities.
Tools
At some point (earlier in OpenAI, more recently in others), LLMs were given the ability to call external functions. These functions can be anything from Python code to external API calls. Think of something as simple as knowing the current time and date. That information was not part of the training, so without using an external tool, the model could never respond correctly.
Adding external functions is possible thanks to two virtues of these models. On one hand, they naturally work well with JSON formats, making communication to and from the LLM very simple. On the other hand, LLMs are also quite efficient at understanding user intentions. This means that if we precisely define the objective of a function in natural language, it is likely that these models can interpret when to use it and transform our request into a JSON to call an API or function.
However, one scenario is having an LLM with one or two auxiliary functions (e.g., a web search engine and a code interpreter), and another is designing a system with a variety of functions and API calls. Designing systems of this kind is much more complex, and this is where multi-agent approaches come into play, which we will discuss later.
Memory
Another challenge that arose with the expansion of LLM usage has to do with memory. We want the model to remember our preferences or important details from previous conversations without having to repeat them at every step.
Memory, attention, and context window are closely related concepts. If an LLM had an infinite context window (as Google promises or proposes by May 2024), this would imply two things. On one hand, we could pass the entirety of the user's conversations, interactions, and preferences within each message. But it would also mean that the attention mechanisms have reached a level of precision where, in that (almost) infinite mass of text, the LLM would know what to pay attention to in each case and for each interaction.
Without reaching this scenario, there are various implementations that try to simulate a memory for the LLM. The simplest consists of the user explicitly stating which elements of the conversation they want the model to remember. A step further is establishing mechanisms for the LLM itself to summarize parts of the conversation, distilling topics, and storing that in memory, or trying to remember details it finds relevant. This way, we would build a database of user information available at all times as context for each interaction.
These techniques are not perfect and require the use of auxiliary functions and mechanisms similar to RAGs as this database grows and becomes more sophisticated. In turn, other challenges, like the temporal dimension, start to appear. What happens if something recorded in memory as a preference changes over time? As of the time of writing this article (May 2024), OpenAI has just announced the incorporation of memory for the paid version of ChatGPT 4, and the model with the largest context window has 1 million tokens (Gemini Pro 1.5).
Planning
However, one of the tools most closely linked to the idea of agents is planning. By planning, we mean all paradigms that attempt to introduce models or algorithms of thought or reasoning to LLMs.
The first cases of planning paradigms appeared early on, with techniques derived from prompt engineering. One of the most famous consisted of asking an LLM to think step by step before responding. The mere introduction of this instruction in the prompt significantly improved the model's performance.
The formalization of this mechanism led to what was called chains of thought. Following this, a series of consecutive paradigms appeared, such as trees of thought, graphs of thought, algorithms of thought, etc. In all cases, the process consists of subdividing the original task into smaller tasks and exploring different paths to reach a solution.
To make this possible, another parallel branch had to be developed, related to the models' self-reflection. That is, the ability of an LLM to observe the result it produced and be critical of it. The first paradigm that emerged for this (still quite relevant as of May 2024) was ReAct, for Reason and Act. It simply involves the instruction to divide a task into parts, execute each of them, and immediately afterward evaluate whether it met the initial objective and, if not, divide the task into smaller parts again and execute it with the feedback from the first step.
With the introduction of these planning techniques, we begin to talk about agents. However, for these agents to achieve greater degrees of autonomy, we not only have to combine all the previous techniques, but new challenges also appear. This is where multi-agent approaches come into play.
Multiagent
Starting from the previous scheme, we can think of having an LLM equipped with tools (calls to functions or connections to external APIs), memory (which can be short-term to remember the most relevant aspects of a particular interaction or long-term RAG-like to connect with the history of conversations and preferences), and planning paradigms (with task subdivision and self-reflection on the results). By adding all these parts, we can build agents with a considerably high degree of autonomy.
However, as we complicate the tasks we ask the agent to perform, we see some problems arise:
The definition of when to use each tool is no longer clear.
There may be an overlap of function calls.
This can result in the need to define dependencies between functions or establish parallelization mechanisms to optimize response times.
What is stored in memory becomes increasingly large.
For some tasks or function calls, we may require certain elements from that memory, while in others, we do not.
This can create confusion or inefficiencies when searching within that memory for what is relevant for each step.
We have defined a single planning mechanism for all tasks.
Some tasks work better with certain planning paradigms, resulting in a loss of efficiency.
Considering this, the multi-agent approach emerges. In this case, we define an agent for each task we need to accomplish. Each agent will have its own tools, memory, and planning paradigm. Some will be more complex, others simpler. Some will have more advanced models, others simpler models but with better RAG mechanisms. We can think of the interaction with the end-user as another task, with its specific agent that not only interacts with the user and translates requests but also has the ability to verify the results of the other agents and be critical of them.
The future
In this article, we reviewed the various components that have been added to models to advance towards autonomous entities based on LLMs that can automate tasks with a level of efficiency equal to or greater than that of a human. There are still many challenges to overcome, but the speed at which progress is being made, creating new tools and overcoming obstacles, is very high.
Alongside the technical challenges is the need to lower both the input and output costs of the models and the latency at which they respond. This process is also advancing rapidly, whether through hardware development or optimization of the different stages of the models.
The complexity of developing agents or multi-agents, coupled with their cost and inaccuracy, means that there are still very few use cases in the industry where this is profitable. However, if this curve of technological progress and cost reduction continues, we can expect this to change rapidly in the coming months.