AI Costs Too High? Here's How to Fix It

05.06.2026

Better results, fewer tokens - the best strategies

The token invoice arrives, and the number is bigger than expected. A brief moment of panic. But before switching into full savings mode: rising AI costs mean your team is actually using AI. The transformation is happening. That's a good thing.

Of course, AI should also be used economically. And here lies one of the most satisfying insights from real-world AI practice: using AI more efficiently almost always produces better results too. Cost optimisation and quality optimisation go hand in hand. Here's how.

The chat trap: The most expensive misunderstanding

One of the most common cost drivers is also the least visible. Many team members get into the habit of pulling up old chats from their history and simply continuing where they left off. It feels practical - but it's a cost trap.

How chats actually work: Every time you send a message in a chat, it's not just your new question that gets sent to the AI provider. The entire previous chat - every single message, every reply, every uploaded document - is transmitted as input. With a long chat containing extensive analyses or PDF files, a single short follow-up question can easily cost ten euros or more.

This creates three problems at once:

High costs from massive input token volumes
Longer response times, because the model has more to process
Worse answers - studies show a significant drop in performance once the context window is around 40% full

The one-million-token misunderstanding

"But the model has a context window of one million tokens - surely you can have as long a chat as you like!" This is understandable, but misleading. Modern models do have enormous context windows of up to one million tokens (roughly 750,000 words or 3,000 pages of text). But the quality of responses measurably deteriorates as the context grows. The model starts to "forget" or blur earlier information.

The fix is simple: Start a fresh chat for each new task. If you genuinely need context from a previous conversation, ask the assistant to summarise the key points and paste that summary into the new chat. The results are better, the chat runs faster, and costs drop significantly.

Choosing the right model: Performance tiers instead of model jungle

The number of available AI models today is almost overwhelming. OpenAI, Anthropic, Google, Mistral - every provider has its own naming conventions, version numbers, and marketing labels. GPT-5.2, Claude Sonnet 4.6, Gemini 3.5 Flash - what's comparable to what?

In the Radio Creator AI Tools, we've sorted all models from all providers into unified performance tiers. From top to bottom:

Tier	Best suited for	Cost
frontier	Flagship models, maximum capability	$$$$$
premium	Complex analysis, agents, code	$$$$
high	Professional editorial work	$$$
medium	General texts, summaries	$$
small / mini / nano	Simple, frequently repeated tasks	$

The key insight: For most editorial tasks - writing copy, editing scripts, preparing presenter links, creating social media posts - high tier models are more than sufficient. A frontier model delivers no noticeably better results for these tasks, but costs many times more.

When smaller models are the smarter choice

For simple, frequently repeated tasks, mini and nano tiers are the best option - and not just for cost reasons. They're faster and direct enough for clearly defined jobs:

Assigning keywords: A piece of content needs to be tagged automatically → nano model is plenty
Sentiment analysis: Is this social media post positive, negative, or neutral? → mini model
Categorisation: Which section or topic area does this story belong to? → small model
Short summaries of news copy → medium model

Tiered model routing - deliberately assigning tasks to the cheapest adequate model - can reduce average costs per request by 60 to 80 percent compared to routing everything through a single premium model.

Thinking budget: The hidden cost lever

Modern AI models can "think" - they simulate internal reasoning steps before generating a response. This is called reasoning. It makes answers significantly better for complex tasks. But: reasoning tokens cost extra - when using reasoning models, multiply your expected costs by a factor of 3 to 5 to get realistic estimates.

The problem: every provider has its own system for controlling the thinking budget. And even within a single model family, the settings differ.

In the Radio Creator AI Tools, we've standardised this. For all models and all providers, the same levels are available: minimal (thinking off), low, medium, high, xhigh, and auto. If you're unsure, choose auto - the model then decides how much reasoning the task actually needs.

When to turn the thinking down

Not every task needs deep reasoning. Here it makes sense to reduce the thinking budget or switch it off entirely:

Writing to a template (e.g. presenter links in the station's house style)
Simple translations
Formatting tasks (fitting text into a specific layout or structure)
Short social media posts derived from an existing piece of content
Standardised summaries following a fixed schema

For these tasks, low or even minimal is enough. That saves time and money without any loss in quality.

Token caching: The silent money-saver working in the background

Here's one of the most powerful cost-saving measures - and the good news is: in the Radio Creator AI Tools, it's already activated by default.

How token caching works: Imagine you call up the same assistant every morning. That assistant has a long system prompt - a detailed description of its role, its tone, its rules. Without caching, this system prompt is sent to the AI provider in full and reprocessed from scratch every single time. With caching, it's stored after the first call. All subsequent calls draw from the cached version - and pay only a fraction of the normal price.

Cached input tokens cost around 10 percent of the normal price at OpenAI and Anthropic - a saving of 90 percent on the cached portion of the input. For applications with consistent system prompts, this can reduce input costs by 70 to 90 percent.

In the Radio Creator AI Tools, token caching is automatically active for all models that support it. There's nothing to configure - the savings happen in the background.

Optimising assistants: Invest once, save continuously

This is perhaps the biggest long-term lever for teams using AI regularly. The strategy: create a dedicated assistant for recurring tasks - with a carefully crafted system prompt.

Why this saves money: a well-configured assistant needs fewer explanations in every new chat. You don't have to tell it every time what tone to use, which terminology your station uses, or which formats to follow. All of that lives in the system prompt - and thanks to token caching, it's processed at a fraction of the normal cost.

How it works in practice

Create dedicated assistants for recurring tasks: a presenter links assistant, a social media assistant, a news assistant, a research assistant.
Have the system prompt written for you - in the Radio Creator AI Tools, the assistant "Prompty" does exactly this. Describe your task, and Prompty writes a professional system prompt.
Always start with a fresh, empty chat - not from the history. This avoids the chat trap.
Continuously improve the system prompt: In the first few weeks, you'll notice the assistant doesn't yet know everything - it might be missing information about your station's style, specific programme formats, or editorial conventions. Add these details to the system prompt step by step. The assistant gets better with every iteration, and you need to explain less and less each time.

A smaller model with a precise system prompt often outperforms a larger model with a vague one. A well-configured high assistant beats a poorly configured frontier assistant - and costs significantly less.

More tips: Staying in control

Set budget limits and spending alerts

Every AI provider allows you to set a monthly cost limit. Once the limit is reached, no further requests are processed. You can also set up budget alert emails - for example, a notification when 50% or 80% of the budget has been used. No more unpleasant surprises at the end of the month.

OpenAI: platform.openai.com → Usage → Manage spend alerts
Anthropic: console.anthropic.com → Credits → Limits
Google AI: aistudio.google.com → Billing

Short prompts, clear instructions

Long, rambling prompts cost more tokens - and often produce worse results. Clear, precise instructions are more efficient. If you tell the model exactly what to do, you don't need to explain what not to do.

Request structured output

When you need data or lists, ask for structured formats. A clearly organised response is shorter than an extended prose answer - and much easier to work with downstream.

Efficiency is not a compromise

Using AI more efficiently almost always means getting better results too. Short, focused chats. The right model for the right task. A well-maintained system prompt. A conscious approach to the thinking budget. These aren't restrictions - this is professional AI practice.

The Radio Creator AI Tools are built so that many of these optimisations happen automatically: token caching is active by default, models are sorted into clear performance tiers, and the thinking budget can be controlled uniformly across all providers.

Want to see how this works in practice? Explore the AI Tools and discover how much more efficient your editorial workflow can become: radio-creator.com/en/ai-tools.html

Back to the news overview