Implementing AI Agents with Tool Use and Function Calling

Last Tuesday, I spent 14 hours debugging a production incident where our customer support agent entered an infinite loop, attempting to refund the same $45.00 transaction 212 times. The model wasn't 'hallucinating' in the traditional sense; it was successfully calling the refund_payment tool, but because the tool's success response was slightly too verbose, the model interpreted the confirmation as a prompt to 'verify and repeat' the action. This is the reality of building AI agents in 2026: the challenge isn't getting the model to speak; it's getting it to act reliably within the constraints of your existing infrastructure.

The Shift from RAG to Agency

In 2024, we were obsessed with Retrieval-Augmented Generation (RAG). We treated LLMs as fancy search engines that could summarize PDF files. Today, that's table stakes. The real value is now in 'Agentic Workflows'—systems where the LLM is the reasoning core that selects and executes tools to change the state of the world. We've moved from 'Read-Only' AI to 'Read-Write' AI. This shift requires a fundamental change in how we write code. You aren't just writing prompts anymore; you are designing APIs that a non-deterministic entity will consume. If your tool definitions are brittle, your agent will break. If your error handling is vague, your agent will hallucinate a 'success' state when it actually failed.

The Schema is the Contract

The most common mistake I see engineers make is treating tool descriptions as an afterthought. In a production environment using models like GPT-4o-2026-01 or Claude 3.7, the model uses the tool's JSON schema to understand its capabilities. If your schema is ambiguous, your agent is dangerous. I use Pydantic for everything. It provides a single source of truth for validation, serialization, and JSON schema generation.

Defining Robust Tools

Don't just pass a string to a tool. Use structured data. Here is a pattern I use for a CRM integration agent that handles lead updates. Notice the explicit descriptions—these are not for humans; they are for the model's attention mechanism.

from pydantic import BaseModel, Field
from typing import Optional, Literal
import enum

class LeadStatus(str, enum.Enum):
    NEW = "new"
    CONTACTED = "contacted"
    QUALIFIED = "qualified"
    LOST = "lost"

class UpdateLeadSchema(BaseModel):
    """
    Updates a lead's record in the CRM. Only call this when the user explicitly 
    requests a change to their status or contact details.
    """
    lead_id: str = Field(..., description="The UUID of the lead, e.g., 'LD-9928'")
    status: Optional[LeadStatus] = Field(None, description="The new workflow stage")
    priority_score: Optional[int] = Field(None, ge=1, le=10, description="Internal score 1-10")
    summary_note: str = Field(..., min_length=10, description="A mandatory brief explanation of why the update occurred.")

def update_lead_record(data: UpdateLeadSchema):

# Implementation logic here
return {"status": "success", "updated_id": data.lead_id}

By using an Enum and a Field with constraints (like ge=1), you are narrowing the model's search space. This drastically reduces the likelihood of the model passing a value like 'Very High' instead of an integer.

The Execution Loop: State over Scripts

Linear scripts like query -> response -> tool -> response don't work for complex tasks. You need a state machine. I've found that graph-based architectures (like LangGraph or similar state-management patterns) are the only way to handle multi-step reasoning without the logic turning into spaghetti code. The agent needs to be able to 'loop back' to a planning stage if a tool returns an error.

Managing the Reasoning Trace

Here is a simplified version of a stateful loop that handles tool errors gracefully. The key is the ToolNode which intercepts execution and feeds errors back to the model as observations, rather than crashing the application.

import json
from typing import List, Dict

def agent_loop(initial_prompt: str, tools: Dict):
    messages = [{"role": "user", "content": initial_prompt}]
    
    for i in range(5):  # Limit iterations to prevent infinite loops
        response = llm.invoke(messages, tools=list(tools.values()))
        messages.append(response)
        
        if not response.tool_calls:
            return response.content
            
        for tool_call in response.tool_calls:
            tool_name = tool_call["name"]
            args = tool_call["args"]
            
            try:

            # Execute the actual function
            result = tools[tool_name](**args)
            messages.append({
                "role": "tool", "tool_call_id": tool_call["id"], \
                "content": json.dumps(result)
            })
        except Exception as e:
            # Feed the error back to the model so it can try to fix its input
            messages.append({
                "role": "tool", "tool_call_id": tool_call["id"], \
                "content": f"Error: {str(e)}. Please correct the arguments and try again."
            })
return "Max iterations reached without resolution."

The Observability Gap

When an agent fails in production, you can't just look at a stack trace. You need the full 'Thought-Action-Observation' trace. In 2026, I recommend logging not just the tokens, but the specific schema version used during the call. I've seen cases where a model performed flawlessly on a Monday and started failing on a Tuesday because a downstream API added a required field to a JSON response that the model wasn't expecting, causing it to hallucinate a value to fill the gap.

Production Gotchas (What the docs don't tell you)

The Context Window is a Trash Can: Every time a tool is called, the output is appended to the history. If your tool returns a massive 50KB JSON blob of raw database rows, you will exhaust your context window and degrade the model's reasoning capabilities within three turns. Prune your tool outputs. Return only the fields the model actually needs to see.
Tool-Use Loops: Models can get stuck. If a tool returns {'error': 'invalid id'} and the model thinks the ID is correct, it might try the exact same call five times. You must implement a 'Cycle Detector' in your agent logic to break these loops and escalate to a human.
Parallel Tool Calling: Modern models try to be efficient by calling multiple tools at once (e.g., get_weather and get_time). If these tools have side effects or dependencies (e.g., create_user then assign_role), parallel execution will fail. Explicitly disable parallel calling in your model configuration if your tools are sequential.

Takeaway

Stop building 'God Tools' that try to do everything. Build small, atomic, strictly-typed functions. Tomorrow morning, take one of your existing AI prompts and replace a vague natural language instruction with a structured Pydantic tool. You'll see an immediate jump in reliability. The future of software engineering isn't just writing code for CPUs; it's writing interfaces for reasoners.","tags":["AI","LLM","Python","Software Engineering","Agents"],"seoTitle":"Building AI Agents with Function Calling & Tool Use | Ugur Kaval","seoDescription":"A deep dive into building production AI agents using tool calling, Pydantic validation, and stateful orchestration. Real-world code and architecture patterns."}

Beyond Chatbots: Engineering Production-Grade AI Agents with Tool Use

The Shift from RAG to Agency

The Schema is the Contract

Defining Robust Tools

The Execution Loop: State over Scripts

Managing the Reasoning Trace

The Observability Gap

Production Gotchas (What the docs don't tell you)

Takeaway

Enjoyed this article?

Related Articles

Scaling Engineering Velocity: Building Autonomous Code Review Pipelines in 2026

Engineering Reliable AI Agents: A Practical Guide to Tool Use and Function Calling

Uğur Kaval

Beyond the Linter: Engineering AI-First Review Pipelines in 2026