Prelude

s it currently stands, all LLMs are doing next-token prediction.

When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution.
Source: Anthropic Engineering

While some labs are incorporating message routers to try to determine the best inputs and model choice.

GPT‑5 is a unified system with a smart, efficient model that answers most questions, a deeper reasoning model (GPT‑5 thinking) for harder problems, and a real‑time router that quickly decides which to use based on conversation type, complexity, tool needs, and your explicit intent (for example, if you say "think hard about this" in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model.
Source: OpenAI

User input is still wrapped into an API request, parsed into a chat template and tokenized for the LLM to generate a response via next-token prediction.

This post is a write up on my current understanding and thoughts about interacting with inference providers. It's really fun to interact with large language models and try to steer them so let's talk about it.

Inference Flow

/v1/chat/completions

OpenAI's chat completions API specification has become the de facto standard, with most providers adding minor configuration options on top.

{
    "model": "gpt-3.5-turbo-0613",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Get the current time"}
    ],
    "tools": [...],         # Array of tool definitions
    "tool_choice": "auto",  # Optional: how the model should use tools
    "temperature": 1.0      # Optional: sampling temperature
}

API requests to /v1/chat/completions work by sending a JSON payload containing the model name, conversation messages (system, user, and assistant), and optional parameters like tools, temperature, etc. The API frontend processes this structured input into the inference chat template processor, which organizes the messages into a format the model can understand, then generates a response that's returned to the client.

Local Inference

Local inference can vary. When using a tool like llama-server via llama.cpp, the model parameter is neither required nor working while ollama requires the model parameter. Generally it's best practice to define the model parameter even if it's not required.

llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf
...
main: server is listening on http://127.0.0.1:8080 - starting the main loop

Simply serves a single model while ollama opts to use the model parameter and serve any downloaded model on the filesystem.

Chat Templates

pen weight models are invaluable for understanding how models interpret API requests. Let's examine Qwen3-235B-A22B's template to get a better understanding of how our API request formats. Special tokens will play a key role throughout the template.

Handling tools

{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}

If tools are provided, the system prompt starts with:

A <|im_start|>system block.
Optional inclusion of the first message if it is a system role.
A tools declaration section describing:
- What tools are available (<tools> ... </tools>).
- How the assistant should return tool calls (<tool_call> ... </tool_call>).
Each tool is JSON-encoded and injected inside a <tools> XML block.

If there are no tools, it instead just dumps the system message:

{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}

The inclusion of tools in this case is tied to the input message start <|im_start|>system special token with the system message role.

Multi-step tool detection

{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}

Loop over the messages in reverse.
Finds the last user query before tool responses (<tool_response>) begin.
Set ns.last_query_index to the index of that user query.

This section is important because it determines where to insert <think> blocks.

Message iteration: The Big Loop

{%- for message in messages %}

Is the meat and potatoes of the chat template. It goes through all messages and applies a different formatting depending on role assigned to the message.

User & secondary system messages

{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
    {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}

Wraps with <|im_start|>user ... <|im_end|> special tokens.
Allows extra system messages (after the first).

Assistant messages

This is the most complex part.

Extract reasoning (<think>)

{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
    {%- set reasoning_content = message.reasoning_content %}
{%- else %}
    {%- if '</think>' in content %}
        {%- set reasoning_content = content.split('</think>')[0]... %}
        {%- set content = content.split('</think>')[-1]... %}
    {%- endif %}
{%- endif %}

The assistant can have hidden "reasoning" (<think>...</think>).

This strips it out into reasoning_content, leaving the visible answer in content.

Decide whether to show reasoning

{%- if loop.index0 > ns.last_query_index %}
    {%- if loop.last or (not loop.last and reasoning_content) %}
        {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}

If this assistant message happens after the last user query, it may include reasoning.

Otherwise, just show the content.

Tool calls

{%- if message.tool_calls %}
    {%- for tool_call in message.tool_calls %}
        ...
        {{- '<tool_call>\n{"name": "' }}{{- tool_call.name }} ...

Any tool calls are serialized in <tool_call>...</tool_call> XML tags.

Ensures multiple tool calls are correctly separated.

Tool responses

{%- elif message.role == "tool" %}
    {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
        {{- '<|im_start|>user' }}
    {%- endif %}
    {{- '\n<tool_response>\n' }}
    {{- content }}
    {{- '\n</tool_response>' }}
    {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
        {{- '<|im_end|>\n' }}
    {%- endif %}

Tool responses are actually shown to the model as if user wrote them, wrapped in <tool_response>.

Multiple tool responses get batched into one <|im_start|>user ... <|im_end|> block.

Final assistant generation prompt

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}
{%- endif %}

At the very end, if we’re asking the model to generate the next turn:

Starts the <|im_start|>assistant block.
Optionally inserts an empty <think> block (if explicit reasoning is disabled).

Summary of Flow

System & tools → prepare context.
Detect last user query (for reasoning cutoff).
Loop through messages:
- User/system → wrap normally.
- Assistant → handle reasoning + tool calls.
- Tool → shown as <tool_response> inside a user block.
- Add assistant generation prompt if needed.

In practice:

This template is what takes a JSON-like chat history and turns it into tokenized markup the LLM can understand.

<|im_start|>role … <|im_end|> are special markers for message boundaries.
<think> blocks separate reasoning from visible answers.
<tool_call> / <tool_response> implement structured tool use.

API Tool Response

Parsing tool responses is the client's responsibility.

[
    {
        "id": "fc_12345xyz",
        "call_id": "call_12345xyz",
        "type": "function_call",
        "name": "get_weather",
        "arguments": "{\"location\":\"Paris, France\"}"
    },
    {
        "id": "fc_67890abc",
        "call_id": "call_67890abc",
        "type": "function_call",
        "name": "get_weather",
        "arguments": "{\"location\":\"Bogotá, Colombia\"}"
    },
]

Capturing, serializing, and executing tool calls is entirely on the client to orchestrate. What returns to the model as a <tool_response> is also entirely on the client to determine. The response can be as long or as short as deemed necessary.

Tool Definitions: Keeping It Concise

ool definitions are essentially function signatures with documentation. The challenge of this is balancing clarity with token efficiency.

Anatomy of a Tool Definition

{
    "type": "function",
    "name": "get_weather",
    "description": "Get current temperature for a given location.",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City and country e.g. Bogotá, Colombia"
            }
        },
        "required": [
            "location"
        ],
        "additionalProperties": false
    }
}

Source: OpenAI Function Calling Guide

The function json definition is straight forward and should be familiar to anyone with some programming experience. The developer defines a name and set of parameters and can provide as much or as little documentation to decorate that.

Best Practices

Keep descriptions concise. Every word costs tokens.
Be specific about formats, YYYY-MM-DD HH:MM:SS not "date format".
Use clear parameter names, self-documenting if possible.
Optimize for your model, test what works best, iterate on tool definitions.

Tool Philosophy: Balancing Power and Safety

The Benchmark Trap

LLM system cards showcase impressive capabilities with open-ended tools:

The model is provided access to a code editor and a Terminal Tool, which enables asynchronous management of multiple terminal sessions…
Source: Claude 4 System Card, page 117

This is great for benchmarking a models efficacy but terrible for a production system. Budget constraints, time, security, and user experience all play a vital role when delivering a software system. Therefore every tool introduced should be highly scrutinized and curated.

Real-World Constraints

Every tool represents tradeoffs:

Capability vs Safety: Bash executor vs constrained DSL
Flexibility vs Cost: Generic tools vs specialized ones
Power vs Control: Terminal access vs specific commands

Tool Selection Strategy

Start with the minimum viable toolset.
Add tools based on actual needs.
Consider safety implications.
Monitor token usage.
Measure success rates.

Over Inclusion and Context Management

The Context Pollution Problem

Context pollution is the measurable distance between original intent and current direction, created by the natural entropy of complex interactions.
Source: kurtiskemple.com

Large language models are non-deterministic systems. Therefore it's much easier to deviate away from goals because every output will never be the same as the previous.

How Tools Pollute Context

Looking back to our Qwen3 chat template, the tools[] array prints right into the <tools></tools> XML block in the formatted chat template, taking up valuable context and tokens.

The Cost Calculation

10 tools × 100 tokens each = 1000 tokens
Every request includes all tools
Irrelevant tools distract from the goal

Development Strategies

Dynamic Tool Loading: Only include relevant tools
Tool Grouping: Create tool sets for different tasks
KISS (Keep it simple, stupid): Start simple, add complexity as needed
Token Budgeting: Set limits for tool definitions

Tool Development: The Implementation Details

Bad Tool Definition: Overly Complex

{
    "type": "function",
    "function": {
        "name": "file_system_operation_handler",
        "description": "This is a comprehensive file system management tool that can perform various operations including but not limited to reading files, writing files, deleting files, creating directories, checking file existence, getting file metadata, and more. It supports both text and binary files with various encoding options.",
        "parameters": {
            "type": "object",
            "properties": {
                "operation": {
                    "type": "string",
                    "enum": ["read", "write", "delete", "mkdir", "exists", "stat"],
                    "description": "The type of file system operation to perform"
                },
                "path": {
                    "type": "string",
                    "description": "The file or directory path to operate on"
                },
                "content": {
                    "type": "string",
                    "description": "Content for write operations (optional)"
                },
                "encoding": {
                    "type": "string",
                    "description": "File encoding (utf-8, ascii, etc.)"
                },
                "create_parents": {
                    "type": "boolean",
                    "description": "Whether to create parent directories"
                },
                "offset": {
                    "type": "integer",
                    "description": "Byte offset for partial reads"
                },
                "limit": {
                    "type": "integer",
                    "description": "Maximum bytes to read"
                }
            },
            "required": ["operation", "path"]
        }
    }
}

Problems:

Does too many things (violates single responsibility)
Vague, wordy description wastes tokens
Complex parameter validation logic
Model must reason about operation type first
Error-prone with many optional parameters

Good Tool Definition: Single Purpose

{
    "type": "function",
    "function": {
        "name": "read_file",
        "description": "Read text file contents. Returns full content or partial with offset/limit.",
        "parameters": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute file path"
                },
                "offset": {
                    "type": "integer",
                    "description": "Number of lines to skip from the beginning of the file (0-indexed)",
                    "minimum": 0 
                },
                "limit": {
                    "type": "integer",
                    "description": "Maximum number of lines to return after the offset",
                    "minimum": 1
                }
            },
            "required": ["path"]
        }
    }
}

Benefits:

Single, clear purpose
Concise description saves tokens
Simple parameters
Line-based (not byte-based) for text files

What is an Agent?

n agent is fundamentally simple and should remain as such. Keeping the logic flow of an agent to the bare minimum allows us to add complexity elsewhere to steer the task.

API request to LLM in a loop
Parse tool calls from response
Execute tools
Feed results back
Repeat until done

No magic, just a while loop matching on tool calls and returning the tool results continuously until no tool calls remain.

The Core Agent Loop

pub async fn execute_chat_with_tools(
    &self,
    mut messages: Vec<Message>
) -> Result<ChatResult> {
    let mut turn_count = 0;

    loop {
        // 1. Check exit conditions
        if turn_count >= self.config.max_turns {
            return Err(CoreError::MaxTurnsExceeded);
        }

        // 2. Send to LLM
        let request = ChatRequest::new(messages.clone())
            .with_tools(self.get_tools())
            .with_streaming(false);

        let response = self.chat_with_retry(request).await?;
        let assistant_message = response.message.clone();

        // 3. Check for tool calls (THIS IS THE EXIT)
        if assistant_message.tool_calls.is_empty() {
            return Ok(ChatResult {
                content: assistant_message.content,
                messages,
            });
        }

        // 4. Execute each tool
        for tool_call in &assistant_message.tool_calls {
            match self.tool_executor.execute(tool_call).await {
                Ok(result) => {
                    messages.push(Message::tool(
                        conversation_id,
                        result,
                        tool_call.id.clone(),
                    ));
                }
                Err(e) => {
                    messages.push(Message::tool(
                        conversation_id,
                        format!("Error: {}", e),
                        tool_call.id.clone(),
                    ));
                }
            }
        }

        // 5. Continue loop
        turn_count += 1;
    }
}

Key Insights

The loop exits when no tools are called, this is the termination condition
Each turn adds to conversation history
Errors don't break the loop, they are the tool response
Turn limits prevent infinite loops

Common Pitfalls

Even with the best intentions it can be difficult to avoid some common pitfalls during tool development. Failure and iteration is necessary for progress but it's important to build safeguards to avoid them. Here are some issues I've encountered during development and some mitigations I've tried to alleviate them.

Token Explosion

Conversation grows unbounded beyond token limit

Summarize old messages
Reset on task completion

Tool Call Loops

Model keeps calling same tool

Track tool call history
Add "already tried" to <tool_result>
Implement circuit breakers

Ambiguous Tool Selection

Model chooses wrong tool

Develop tool descriptions against smaller models
Avoid overlapping tools
Set tool categories/namespaces

Summary

riting software that interacts with large language models has a lot of nuance to it. To complicate matters further, the field is moving rapidly with new techniques and protocols trying to stake their claim.

Personally I like to keep my tool definitions and prompts as concise as possible. Offloading as much logic and constraining tool definitions to limit creativity and diversity of output from the model. While this might seem counterintuitive at first, I find that successful and meaningful results come within less turns.

Thanks for reading!