Beyond Chatbots: Engineering Production-Grade AI Agents with Tool Use
Stop treating LLMs as oracles and start treating them as orchestrators. Learn how to build reliable, schema-validated agents that interact with real-world APIs using modern 2026 patterns.

Last Tuesday, I spent 14 hours debugging a production incident where our customer support agent entered an infinite loop, attempting to refund the same $45.00 transaction 212 times. The model wasn't 'hallucinating' in the traditional sense; it was successfully calling the refund_payment tool, but because the tool's success response was slightly too verbose, the model interpreted the confirmation as a prompt to 'verify and repeat' the action. This is the reality of building AI agents in 2026: the challenge isn't getting the model to speak; it's getting it to act reliably within the constraints of your existing infrastructure.
The Shift from RAG to Agency
In 2024, we were obsessed with Retrieval-Augmented Generation (RAG). We treated LLMs as fancy search engines that could summarize PDF files. Today, that's table stakes. The real value is now in 'Agentic Workflows'—systems where the LLM is the reasoning core that selects and executes tools to change the state of the world. We've moved from 'Read-Only' AI to 'Read-Write' AI. This shift requires a fundamental change in how we write code. You aren't just writing prompts anymore; you are designing APIs that a non-deterministic entity will consume. If your tool definitions are brittle, your agent will break. If your error handling is vague, your agent will hallucinate a 'success' state when it actually failed.
The Schema is the Contract
The most common mistake I see engineers make is treating tool descriptions as an afterthought. In a production environment using models like GPT-4o-2026-01 or Claude 3.7, the model uses the tool's JSON schema to understand its capabilities. If your schema is ambiguous, your agent is dangerous. I use Pydantic for everything. It provides a single source of truth for validation, serialization, and JSON schema generation.
Defining Robust Tools
Don't just pass a string to a tool. Use structured data. Here is a pattern I use for a CRM integration agent that handles lead updates. Notice the explicit descriptions—these are not for humans; they are for the model's attention mechanism.
from pydantic import BaseModel, Field
from typing import Optional, Literal
import enum
class LeadStatus(str, enum.Enum):
NEW = "new"
CONTACTED = "contacted"
QUALIFIED = "qualified"
LOST = "lost"
class UpdateLeadSchema(BaseModel):
"""
Updates a lead's record in the CRM. Only call this when the user explicitly
requests a change to their status or contact details.
"""
lead_id: str = Field(..., description="The UUID of the lead, e.g., 'LD-9928'")
status: Optional[LeadStatus] = Field(None, description="The new workflow stage")
priority_score: Optional[int] = Field(None, ge=1, le=10, description="Internal score 1-10")
summary_note: str = Field(..., min_length=10, description="A mandatory brief explanation of why the update occurred.")
def update_lead_record(data: UpdateLeadSchema):
# Implementation logic here
return {"status": "success", "updated_id": data.lead_id}
By using an Enum and a Field with constraints (like ge=1), you are narrowing the model's search space. This drastically reduces the likelihood of the model passing a value like 'Very High' instead of an integer.
The Execution Loop: State over Scripts
Linear scripts like query -> response -> tool -> response don't work for complex tasks. You need a state machine. I've found that graph-based architectures (like LangGraph or similar state-management patterns) are the only way to handle multi-step reasoning without the logic turning into spaghetti code. The agent needs to be able to 'loop back' to a planning stage if a tool returns an error.
Managing the Reasoning Trace
Here is a simplified version of a stateful loop that handles tool errors gracefully. The key is the ToolNode which intercepts execution and feeds errors back to the model as observations, rather than crashing the application.
import json
from typing import List, Dict
def agent_loop(initial_prompt: str, tools: Dict):
messages = [{"role": "user", "content": initial_prompt}]
for i in range(5): # Limit iterations to prevent infinite loops
response = llm.invoke(messages, tools=list(tools.values()))
messages.append(response)
if not response.tool_calls:
return response.content
for tool_call in response.tool_calls:
tool_name = tool_call["name"]
args = tool_call["args"]
try:
# Execute the actual function
result = tools[tool_name](**args)
messages.append({
"role": "tool", "tool_call_id": tool_call["id"], \
"content": json.dumps(result)
})
except Exception as e:
# Feed the error back to the model so it can try to fix its input
messages.append({
"role": "tool", "tool_call_id": tool_call["id"], \
"content": f"Error: {str(e)}. Please correct the arguments and try again."
})
return "Max iterations reached without resolution."
The Observability Gap
When an agent fails in production, you can't just look at a stack trace. You need the full 'Thought-Action-Observation' trace. In 2026, I recommend logging not just the tokens, but the specific schema version used during the call. I've seen cases where a model performed flawlessly on a Monday and started failing on a Tuesday because a downstream API added a required field to a JSON response that the model wasn't expecting, causing it to hallucinate a value to fill the gap.
Production Gotchas (What the docs don't tell you)
-
The Context Window is a Trash Can: Every time a tool is called, the output is appended to the history. If your tool returns a massive 50KB JSON blob of raw database rows, you will exhaust your context window and degrade the model's reasoning capabilities within three turns. Prune your tool outputs. Return only the fields the model actually needs to see.
-
Tool-Use Loops: Models can get stuck. If a tool returns
{'error': 'invalid id'}and the model thinks the ID is correct, it might try the exact same call five times. You must implement a 'Cycle Detector' in your agent logic to break these loops and escalate to a human. -
Parallel Tool Calling: Modern models try to be efficient by calling multiple tools at once (e.g.,
get_weatherandget_time). If these tools have side effects or dependencies (e.g.,create_userthenassign_role), parallel execution will fail. Explicitly disable parallel calling in your model configuration if your tools are sequential.
Takeaway
Stop building 'God Tools' that try to do everything. Build small, atomic, strictly-typed functions. Tomorrow morning, take one of your existing AI prompts and replace a vague natural language instruction with a structured Pydantic tool. You'll see an immediate jump in reliability. The future of software engineering isn't just writing code for CPUs; it's writing interfaces for reasoners.","tags":["AI","LLM","Python","Software Engineering","Agents"],"seoTitle":"Building AI Agents with Function Calling & Tool Use | Ugur Kaval","seoDescription":"A deep dive into building production AI agents using tool calling, Pydantic validation, and stateful orchestration. Real-world code and architecture patterns."}