Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information
As soon as AI agents were promising, organizations have had to deal with calm if a single agent was sufficient, or if they should invest in building a broader network of multiple agents than touches more points in their organization.
Orchestration Framework Company Langchain He tried to approach an answer to this question. He submitted an agent to several experiments who found individual agents have a context limit and tools before their performance begins to degrade. These experiments could lead to a better understanding of the architecture necessary to maintain agents and systems of multiple agents.
In BlogLangchain detailed a set of experiments he did with a single React agent and compared his performance. The main question that Langchain hoped to answer was: “At which time a single agent reacted is overloaded with instructions and tools, and then you see the performance fall?”
Langchain chose to use the React agent frame Because it is “one of the most basic agent architectures.”
While the comparative evaluation agent yield can often lead to deceptive results, Langchain chose to limit the test to two easily quantifiable tasks of an agent: answer questions and programming meetings.
“There are many existing reference points for the use of tools and the call of tools, but for the purposes of this experiment, we wanted to evaluate a practical agent that we really use,” Langchain wrote. “This agent is our internal email assistant, responsible for two main domains of work: respond and schedule meetings requests and support customers with their questions.”
Langchain mainly used prebuiled react agents through its Langgraph platform. These agents presented large language models (LLM) that became part of the reference test. These LLM included the sonnet Claude 3.5 from Anthrope, Meta’s call-3.3-70b and a trio of openai, GPT-4O, O1 and O3-mini models.
The company broke the tests to better evaluate the performance of the email assistant in both tasks, creating a list of steps to follow. It began with the customer service capabilities of the email assistant, who analyzes how the agent accepts an email from a customer and responds with an answer.
Langchain first evaluated the trajectory of call calls, or the tools that an agent takes advantage of. If the agent followed the correct order, he passed the test. Then, the researchers asked the assistant to respond to an email and used a LLM to judge their performance.
For the second work domain, calendar programming, Langchain focused on the agent’s ability to follow the instructions.
“In other words, the agent must remember specific instructions provided, as exactly when you must schedule meetings with different parties,” the researchers wrote.
Once they defined the parameters, Langchain stressed and overwhelmed the email assistant agent.
Set 30 tasks each for calendar and customer service program. These were executed three times (for a total of 90 races). The researchers created a calendar programming agent and a customer service agent to better evaluate tasks.
“The calendar programming agent only has access to the calendar programming domain, and the customer service agent only has access to customer service domain,” Langchain explained.
Then, the researchers added more tasks and domain tools to the agents to increase the number of responsibilities. These could vary from human resources, guarantee of technical quality, to legal and compliance and a large number of other areas.
After executing the evaluations, Langchain discovered that individual agents often feel too overwhelmed when they were told to do too many things. They began to forget the tools of calling or could not respond to the tasks when they were given more instructions and contexts.
Langchain discovered that the calendar programming agents used GPT-4O “performed worse than Claude-3.5-Sonnet, O1 and O3 in the various context sizes, and performance fell more abruptly than the other models when a context was provided bigger”. The performance of GPT-4O calendar programmers fell to 2% when the domains increased to at least seven.
Other models were not much better. Llama-3.3-70b forgot to call the Send_email tool, “then failed in each case of proof.”
Only Claude-3.5-SONNET, O1 and O3-mini remembered calling the tool, but Claude-3.5-SONNET worked worse than the other two operai models. However, O3-mini’s performance is degraded once irrelevant domains are added to programming instructions.
The customer service agent can resort to more tools, but for this test, Langchain said that Claude-3.5-mini worked as well as O3-mini and O1. He also presented a less deep performance drop when more domains were added. However, when the context window extends, the Claude model works worse.
GPT-4O also did the worst among the proven models.
“We saw that as more context was provided, the instruction was worse. Some of our tasks were designed to follow specific instructions of the niche (for example, not perform a certain action for EU -based clients), ”said Langchain. “We discovered that these instructions would be successfully followed by agents with fewer domains, but as the number of domains increased, these instructions were more frequently forgotten, and the tasks subsequently failed.”
The company said it is exploring how to evaluate the architectures of multiple agents using the same domain overload method.
Langchain is already invested in the performance of the agents, since he introduced the concept of “environmental agents”, or agents that are executed in the background and are triggered by specific events. These experiments could make it easier to discover the best way to guarantee the performance of agents.