Interacting With LLMs
Prelude
As it currently stands, all LLMs are doing next-token prediction.
When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution.
Source: Anthropic Engineering
While some labs are incorporating message routers to try to determine the best inputs and model choice.
GPT‑5 is a unified system with a smart, efficient model that answers most questions, a deeper reasoning model (GPT‑5 thinking) for harder problems, and a real‑time router that quickly decides which to use based on conversation type, complexity, tool needs, and your explicit intent (for example, if you say "think hard about this" in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model.
Source: OpenAI
User input is still wrapped into an API request, parsed into a chat template and tokenized for the LLM to generate a response via next-token prediction.
This post is a write up on my current understanding and thoughts about interacting with inference providers. It's really fun to interact with large language models and try to steer them so let's talk about it.
Inference Flow
/v1/chat/completions
OpenAI's chat completions API specification has become the de facto standard, with most providers adding minor configuration options on top.
{
"model": "gpt-3.5-turbo-0613",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Get the current time"}
],
"tools": [...], # Array of tool definitions
"tool_choice": "auto", # Optional: how the model should use tools
"temperature": 1.0 # Optional: sampling temperature
}API requests to /v1/chat/completions
work by sending a JSON payload containing the model name, conversation
messages (system, user, and assistant), and optional parameters like
tools, temperature, etc. The API frontend processes this structured
input into the inference chat template processor, which organizes the
messages into a format the model can understand, then generates a
response that's returned to the client.
Local Inference
Local inference can vary. When using a tool like llama-server via llama.cpp, the model
parameter is neither required nor working while ollama requires the model
parameter. Generally it's best practice to define the model parameter
even if it's not required.
llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf
...
main: server is listening on http://127.0.0.1:8080 - starting the main loopSimply serves a single model while ollama opts to use the model parameter and serve any downloaded model on the filesystem.
Chat Templates
Open weight models are invaluable for understanding how models interpret API requests. Let's examine Qwen3-235B-A22B's template to get a better understanding of how our API request formats. Special tokens will play a key role throughout the template.
Handling tools
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
If tools are provided, the system prompt starts with:
- A
<|im_start|>systemblock. - Optional inclusion of the first message if it is a system role.
- A tools declaration section describing:
- What tools are available (
<tools> ... </tools>). - How the assistant should return tool calls (
<tool_call> ... </tool_call>).
- What tools are available (
- Each tool is JSON-encoded and injected inside a
<tools>XML block.
If there are no tools, it instead just dumps the system message:
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
The inclusion of tools in this case is tied to the input message
start <|im_start|>system special
token with the system message role.
Multi-step tool detection
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
- Loop over the messages in reverse.
- Finds the last user query before tool responses (
<tool_response>) begin. - Set
ns.last_query_indexto the index of that user query.
This section is important because it determines where to insert <think> blocks.
Message iteration: The Big Loop
{%- for message in messages %}
Is the meat and potatoes of the chat template. It goes through all messages and applies a different formatting depending on role assigned to the message.
User & secondary system messages
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}- Wraps with
<|im_start|>user ... <|im_end|>special tokens. - Allows extra system messages (after the first).
- Wraps with
Assistant messages
This is the most complex part.
Extract reasoning (
<think>){%- set reasoning_content = '' %} {%- if message.reasoning_content is string %} {%- set reasoning_content = message.reasoning_content %} {%- else %} {%- if '</think>' in content %} {%- set reasoning_content = content.split('</think>')[0]... %} {%- set content = content.split('</think>')[-1]... %} {%- endif %} {%- endif %}The assistant can have hidden "reasoning" (
<think>...</think>).This strips it out into
reasoning_content, leaving the visible answer in content.Decide whether to show reasoning
{%- if loop.index0 > ns.last_query_index %} {%- if loop.last or (not loop.last and reasoning_content) %} {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}If this assistant message happens after the last user query, it may include reasoning.
Otherwise, just show the content.
Tool calls
{%- if message.tool_calls %} {%- for tool_call in message.tool_calls %} ... {{- '<tool_call>\n{"name": "' }}{{- tool_call.name }} ...Any tool calls are serialized in
<tool_call>...</tool_call>XML tags.Ensures multiple tool calls are correctly separated.
Tool responses
{%- elif message.role == "tool" %} {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %}Tool responses are actually shown to the model as if user wrote them, wrapped in
<tool_response>.Multiple tool responses get batched into one
<|im_start|>user ... <|im_end|>block.
Final assistant generation prompt
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- endif %}
{%- endif %}
At the very end, if we’re asking the model to generate the next turn:
- Starts the
<|im_start|>assistantblock. - Optionally inserts an empty
<think>block (if explicit reasoning is disabled).
Summary of Flow
- System & tools → prepare context.
- Detect last user query (for reasoning cutoff).
- Loop through messages:
- User/system → wrap normally.
- Assistant → handle reasoning + tool calls.
- Tool → shown as
<tool_response>inside a user block. - Add assistant generation prompt if needed.
In practice:
This template is what takes a JSON-like chat history and turns it into tokenized markup the LLM can understand.
<|im_start|>role … <|im_end|>are special markers for message boundaries.<think>blocks separate reasoning from visible answers.<tool_call>/<tool_response>implement structured tool use.
API Tool Response
Parsing tool responses is the client's responsibility.
[
{
"id": "fc_12345xyz",
"call_id": "call_12345xyz",
"type": "function_call",
"name": "get_weather",
"arguments": "{\"location\":\"Paris, France\"}"
},
{
"id": "fc_67890abc",
"call_id": "call_67890abc",
"type": "function_call",
"name": "get_weather",
"arguments": "{\"location\":\"Bogotá, Colombia\"}"
},
]Capturing, serializing, and executing tool calls is entirely on the
client to orchestrate. What returns to the model as a <tool_response> is also entirely on the
client to determine. The response can be as long or as short as deemed
necessary.
Tool Definitions: Keeping It Concise
Tool definitions are essentially function signatures with documentation. The challenge of this is balancing clarity with token efficiency.
Anatomy of a Tool Definition
{
"type": "function",
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
}
}The function json definition is straight forward and should be familiar to anyone with some programming experience. The developer defines a name and set of parameters and can provide as much or as little documentation to decorate that.
Best Practices
- Keep descriptions concise. Every word costs tokens.
- Be specific about formats,
YYYY-MM-DD HH:MM:SSnot "date format". - Use clear parameter names, self-documenting if possible.
- Optimize for your model, test what works best, iterate on tool definitions.
Tool Philosophy: Balancing Power and Safety
The Benchmark Trap
LLM system cards showcase impressive capabilities with open-ended tools:
The model is provided access to a code editor and a Terminal Tool, which enables asynchronous management of multiple terminal sessions…
Source: Claude 4 System Card, page 117
This is great for benchmarking a models efficacy but terrible for a production system. Budget constraints, time, security, and user experience all play a vital role when delivering a software system. Therefore every tool introduced should be highly scrutinized and curated.
Real-World Constraints
Every tool represents tradeoffs:
- Capability vs Safety: Bash executor vs constrained DSL
- Flexibility vs Cost: Generic tools vs specialized ones
- Power vs Control: Terminal access vs specific commands
Tool Selection Strategy
- Start with the minimum viable toolset.
- Add tools based on actual needs.
- Consider safety implications.
- Monitor token usage.
- Measure success rates.
Over Inclusion and Context Management
The Context Pollution Problem
Context pollution is the measurable distance between original intent and current direction, created by the natural entropy of complex interactions.
Source: kurtiskemple.com
Large language models are non-deterministic systems. Therefore it's much easier to deviate away from goals because every output will never be the same as the previous.
How Tools Pollute Context
Looking back to our Qwen3 chat template, the tools[] array prints right into the <tools></tools> XML block in the
formatted chat template, taking up valuable context and tokens.
The Cost Calculation
- 10 tools × 100 tokens each = 1000 tokens
- Every request includes all tools
- Irrelevant tools distract from the goal
Development Strategies
- Dynamic Tool Loading: Only include relevant tools
- Tool Grouping: Create tool sets for different tasks
- KISS (Keep it simple, stupid): Start simple, add complexity as needed
- Token Budgeting: Set limits for tool definitions
Tool Development: The Implementation Details
Bad Tool Definition: Overly Complex
{
"type": "function",
"function": {
"name": "file_system_operation_handler",
"description": "This is a comprehensive file system management tool that can perform various operations including but not limited to reading files, writing files, deleting files, creating directories, checking file existence, getting file metadata, and more. It supports both text and binary files with various encoding options.",
"parameters": {
"type": "object",
"properties": {
"operation": {
"type": "string",
"enum": ["read", "write", "delete", "mkdir", "exists", "stat"],
"description": "The type of file system operation to perform"
},
"path": {
"type": "string",
"description": "The file or directory path to operate on"
},
"content": {
"type": "string",
"description": "Content for write operations (optional)"
},
"encoding": {
"type": "string",
"description": "File encoding (utf-8, ascii, etc.)"
},
"create_parents": {
"type": "boolean",
"description": "Whether to create parent directories"
},
"offset": {
"type": "integer",
"description": "Byte offset for partial reads"
},
"limit": {
"type": "integer",
"description": "Maximum bytes to read"
}
},
"required": ["operation", "path"]
}
}
}Problems:
- Does too many things (violates single responsibility)
- Vague, wordy description wastes tokens
- Complex parameter validation logic
- Model must reason about operation type first
- Error-prone with many optional parameters
Good Tool Definition: Single Purpose
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read text file contents. Returns full content or partial with offset/limit.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute file path"
},
"offset": {
"type": "integer",
"description": "Number of lines to skip from the beginning of the file (0-indexed)",
"minimum": 0
},
"limit": {
"type": "integer",
"description": "Maximum number of lines to return after the offset",
"minimum": 1
}
},
"required": ["path"]
}
}
}Benefits:
- Single, clear purpose
- Concise description saves tokens
- Simple parameters
- Line-based (not byte-based) for text files
What is an Agent?
An agent is fundamentally simple and should remain as such. Keeping the logic flow of an agent to the bare minimum allows us to add complexity elsewhere to steer the task.
- API request to LLM in a loop
- Parse tool calls from response
- Execute tools
- Feed results back
- Repeat until done
No magic, just a while loop matching on tool calls and returning the tool results continuously until no tool calls remain.
The Core Agent Loop
pub async fn execute_chat_with_tools(
&self,
mut messages: Vec<Message>
) -> Result<ChatResult> {
let mut turn_count = 0;
loop {
// 1. Check exit conditions
if turn_count >= self.config.max_turns {
return Err(CoreError::MaxTurnsExceeded);
}
// 2. Send to LLM
let request = ChatRequest::new(messages.clone())
.with_tools(self.get_tools())
.with_streaming(false);
let response = self.chat_with_retry(request).await?;
let assistant_message = response.message.clone();
// 3. Check for tool calls (THIS IS THE EXIT)
if assistant_message.tool_calls.is_empty() {
return Ok(ChatResult {
content: assistant_message.content,
messages,
});
}
// 4. Execute each tool
for tool_call in &assistant_message.tool_calls {
match self.tool_executor.execute(tool_call).await {
Ok(result) => {
messages.push(Message::tool(
conversation_id,
result,
tool_call.id.clone(),
));
}
Err(e) => {
messages.push(Message::tool(
conversation_id,
format!("Error: {}", e),
tool_call.id.clone(),
));
}
}
}
// 5. Continue loop
turn_count += 1;
}
}Key Insights
- The loop exits when no tools are called, this is the termination condition
- Each turn adds to conversation history
- Errors don't break the loop, they are the tool response
- Turn limits prevent infinite loops
Common Pitfalls
Even with the best intentions it can be difficult to avoid some common pitfalls during tool development. Failure and iteration is necessary for progress but it's important to build safeguards to avoid them. Here are some issues I've encountered during development and some mitigations I've tried to alleviate them.
Token Explosion
Conversation grows unbounded beyond token limit
- Summarize old messages
- Reset on task completion
Tool Call Loops
Model keeps calling same tool
- Track tool call history
- Add "already tried" to
<tool_result> - Implement circuit breakers
Ambiguous Tool Selection
Model chooses wrong tool
- Develop tool descriptions against smaller models
- Avoid overlapping tools
- Set tool categories/namespaces
Summary
Writing software that interacts with large language models has a lot of nuance to it. To complicate matters further, the field is moving rapidly with new techniques and protocols trying to stake their claim.
Personally I like to keep my tool definitions and prompts as concise as possible. Offloading as much logic and constraining tool definitions to limit creativity and diversity of output from the model. While this might seem counterintuitive at first, I find that successful and meaningful results come within less turns.
Thanks for reading!