Building autonomous agents with tool use, planning, and multi-agent systems
AI agents represent the next frontier in large language model applications. Unlike chatbots that respond to a single prompt and terminate, agents can pursue multi-step goals, use tools, remember previous actions, and collaborate with other agents. They transform LLMs from interactive interfaces into autonomous systems that can take actions in the world.
This article covers agent architecture, the key frameworks enabling tool use, prompting strategies like ReAct, and the emerging landscape of multi-agent systems.
An AI agent differs from a simple LLM wrapper in several key capabilities:
AGENT EXECUTION LOOP
┌─────────────────────────────────────────┐
│ 1. PERCEIVE │
│ • Observe environment state │
│ • Process user input │
│ • Check memory for context │
└─────────────────┬───────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 2. REASON │
│ • Evaluate current state vs. goal │
│ • Consider available actions/tools │
│ • Select next action │
└─────────────────┬───────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 3. ACT │
│ • Execute selected action │
│ • Update memory with result │
│ • Evaluate if goal achieved │
└─────────────────┬───────────────────────┘
↓
Loop until goal achieved or max iterations
ReAct (Reason + Act) introduced by Yao et al. (2022) provides a structured prompting approach that interleaves reasoning traces with actions. The key insight: making the model's reasoning explicit before taking actions leads to better task completion.
ReAct Prompt Structure:
Question: The user query
Thought: [Model's reasoning about what to do]
Action: [The action to take - function call]
Observation: [Result of the action]
... (repeat Thought/Action/Observation as needed)
Thought: I now know the final answer
Answer: [Final response]
ReAct dramatically outperforms both pure reasoning (chain-of-thought without actions) and pure action (without explicit reasoning) on multi-step tasks.
| Benchmark | Chain-of-Thought | Action-Only | ReAct |
|---|---|---|---|
| HotpotQA (multi-hop QA) | 46.8% | 41.8% | 54.4% |
| Fever (fact checking) | 56.3% | 52.4% | 61.1% |
| SOK (science reasoning) | 63.5% | 51.8% | 67.2% |
Modern LLMs support structured tool definitions through schemas. When the model decides to use a tool, it outputs a JSON object conforming to the defined schema:
Tool Definition (OpenAI function calling):
{
"name": "search_database",
"description": "Search the product catalog for items matching criteria",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query for products"
},
"category": {
"type": "string",
"enum": ["electronics", "clothing", "home", "sports"],
"description": "Filter by category"
},
"max_price": {
"type": "number",
"description": "Maximum price filter"
}
},
"required": ["query"]
}
}
The model can then call the function with appropriate arguments. The function executes and returns results, which are fed back to the model for the next step.
LangChain provides a comprehensive agent framework with built-in tool integrations and agent types:
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.tools import Tool
from langchain import OpenAI
# Define tools
tools = [
Tool(
name="search",
func=search_function,
description="Search the web for information"
),
Tool(
name="calculator",
func=calculate,
description="Perform mathematical calculations"
)
]
# Create agent
agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# Run agent
result = agent_executor.invoke({"input": "What is the population of Tokyo?"})
LlamaIndex offers agent-oriented abstractions with built-in support for context retrieval and reasoning:
from llama_index.agent import ReActAgent
from llama_index.llms import OpenAI
agent = ReActAgent.from_tools(
tools=[search_tool, document_tool, code_tool],
llm=OpenAI("gpt-4"),
verbose=True
)
response = agent.chat("Analyze our Q3 revenue and compare to Q2")
AutoGPT became a viral demonstration of agent-like behavior: an LLM that breaks down goals into sub-tasks, executes them, and iterates. However, practical deployments revealed significant limitations:
Despite limitations, AutoGPT demonstrated valuable principles:
These insights inform more robust agent frameworks that add the necessary guardrails.
Single agents have limitations: they can only think in one "voice," have fixed tool sets, and struggle with fundamentally different reasoning approaches. Multi-agent systems address this by having multiple agents with different roles collaborate.
┌─────────────┐
│ Supervisor │ (Routes tasks, evaluates results)
└──────┬──────┘
↓
┌────┴────┐
↓ ↓
┌─┴──┐ ┌──┴─┐
│Exec1│ │Exec2│ (Specialized executors)
└─────┘ └────┘
The supervisor routes sub-tasks to specialized executors based on task type. This pattern is common in customer service: a supervisor routes technical questions to a coding agent and policy questions to a knowledge agent.
┌──────────────┐
│ Agent A │ (Argues position)
└──────┬───────┘
↓
┌──────┴───────┐
│ Agent B │ (Counter-argument)
└──────┬───────┘
↓
┌──────┴───────┐
│ Judge │ (Evaluates and decides)
└──────────────┘
Multiple agents debate a question, with a judge evaluating arguments. Research by Liang et al. (2023) showed this improves factual accuracy on complex reasoning tasks—agents catch each other's errors.
Tasks decompose hierarchically, with higher-level agents delegating to lower-level agents:
Level 3: Project Manager
↓
Level 2: Data Analyst, Engineer, Writer
↓
Level 1: (Specialized tools and execution)
This mirrors organizational structures and enables complex projects with appropriate specialization.
Agents in a system need to communicate. Key patterns:
Agents need memory to maintain context across steps. Memory systems typically layer multiple storage types:
Simple approaches use a message buffer. Sophisticated systems use vector stores for semantic retrieval, with summarization to compress long histories:
Memory retrieval for current context:
1. Embed current query
2. Retrieve top-K similar past experiences from vector store
3. Retrieve recent conversation history
4. Combine into context for current step
Robust agents handle tool failures gracefully:
Tool call failed → Check error type:
- Transient error (timeout, rate limit): Retry with backoff
- Permanent error (invalid input): Abort and report
- Partial failure: Try alternative tool or approach
- Max retries exceeded: Fail gracefully with partial results
AI agents represent the practical application of LLM capabilities to real-world workflows. The key to building effective agents isn't raw model power—it's thoughtful architecture: appropriate tool design, robust memory systems, explicit planning, and production-grade error handling.
The field is evolving rapidly. Multi-agent systems are moving from research demonstrations to practical applications. As models improve and frameworks mature, agents will become increasingly capable of handling complex, multi-step tasks with minimal human intervention.