After the first post in this series — where we saw that an AI product is not a prompt, but an architecture of several layers (Interface, Routing, Reasoning, Tools, Memory, Observability) — it's time to take it to the next level.
And as promised, in this second part we get into the determining factors a decision maker should put on the table before kicking off an AI project, among others:
- Investment decisions: CAPEX vs OPEX applied to AI — the most prominent one.
- Supported languages: dialects and non-majority languages lose quality fast outside the top-7.
- Target latency: not every agent needs real time; some run perfectly well in batch.
- Model catalogue: which ones you pick, how many you have, and how you manage them professionally with an LLM Gateway.
- Idle Zero Cost for deployed agents: how to stop your architecture from bleeding while the agents aren't serving anyone.
The goal: that you reach the committee equipped to decide your product's investment strategy with AI concepts properly applied, read the P&L of your AI product properly, and pick the operational pieces that prevent cost from blowing up when you start scaling.
Let's get into it.
Investment strategy for the project applied to AI
Before we talk about models, latency or architecture, there's a classic business decision that still rules: do you want to invest more up front and less in operations —weighting towards CAPEX— or the opposite, weighting towards OPEX? And, as in any business, it depends. But in AI there are new factors in this decision.
What always applies and never changes: it depends on how much you'll use the product, how validated it is, what initial risk you're willing to take and at what scale you expect to operate. The answer can be pure on-premise, pure cloud or a hybrid — but the first question you have to answer yourself is whether your immediate priority is preserving cash and speed or predicting and bringing down unit cost in the long run.
If you haven't validated market fit yet, go OPEX. It's faster, requires less initial investment and gives you access to more powerful models. There will be time to optimise cost later, with real data in hand.
This is nothing new. In any startup, while you don't have validated market fit, speed matters more than margin. And that, translated into AI, means starting with cloud and frontier models via API, unless you have a direct constraint that pushes you to another option.
The reasons, in order:
- Time to market: in a few hours you have an agent working against the top models. With on-premise, you're talking weeks or months just to get the environment operational.
- Minimal initial investment: you pay per use, you don't buy GPUs. If the product doesn't work, you turn off the tap.
- Access to the best the market has to offer: the most capable models are only available via API. You can't buy them.
Once your product is validated, you have real traffic and, above all, you have data on how your product is being used, you can start optimising cost: drop from frontier models to narrower ones, move part of the flow to open-source models on on-premise to bring down per-request cost, and reserve the big model only for the cases that actually need it.
Remember: you don't always need the most powerful model. Most tasks in an agentic product — classify, extract, summarise, validate — are handled perfectly well by small, fast models. Use of the frontier model is the exception, not the rule.
Watch out with on-prem: it isn't only about investment strategy
There's a nuance that doesn't come up often in decision-makers' conversations and that can leave you exposed if you decide purely from a spreadsheet. On-prem isn't only an economic decision: it has a real technical ceiling.
Some tasks — particularly heavy reasoning with many steps, very large contexts and agents that coordinate several decisions — are done much better by models like the top releases from OpenAI or Anthropic. In the case of OpenAI and the GPT family, we're talking about MoE (Mixture of Experts) architectures: models that internally hold hundreds of specialised sub-networks and only activate part of them for each request, with capability equivalent to networks of close to a trillion parameters. Anthropic hasn't confirmed it officially, but their flagship models are widely understood not to use MoE and are estimated at roughly 1T parameters as well.
That kind of model doesn't exist on-premise. It isn't downloadable. You can't rent it for your cluster. It is the frontier, and it is what differentiates you when complexity rises.
That said, important: you don't always need a model that powerful. Miracles can be done with good AI engineering, deterministic logic and architecture — combining inference with traditional code remains the most powerful guardrail, as we saw in the first part. But there are cases that do require those models, and you'd better know how to spot them before committing to a 100% on-prem decision.
Beyond raw capability, there are two other on-prem limitations worth putting on the table before you decide: latency and languages.
On latency, the big cloud providers have spent years optimising inference with dedicated hardware, aggressive batching, shared KV cache and models quantised specifically for their GPUs. Matching that on-prem is possible, but it demands specialised staff and enough volume to amortise the investment. If your product lives or dies by response time, this matters.
On languages, open-source on-premise models are reasonably well tuned in the world's top-7 (English, Spanish, French, German, Portuguese, Italian, Mandarin) and lose quality quite noticeably outside that. If your product has to handle accents, dialects or less represented languages well, be careful: what works fine on a cloud API can fall over on-premise or lose quality. Even more acute in voice models than in text.
Voice agents: an even more demanding case
Move to voice and on-premise gets even harder. There are no production-grade voice models on-premise that match the latency of the well-known voice models from the major providers. Alternatives exist, but they drag an extra problem along: voice models aren't good at reasoning. Their priority is to answer you fast, hear you well, and interpret your prosody and paralinguistic cues.
To reach a decent intelligence index on on-prem voice you have to add layers — a reasoning model that prepares the answer and a voice model that says it — and that shows up in latency, in coherence and in operational complexity.
In voice, on-premise isn't only more expensive to run: it's slower, less intelligent and more fragile. And the user notices: a few hundred milliseconds change how the product feels.
But wait, not everything has to be negative: I'm not saying it's impossible, just a bigger engineering challenge. Let's not assume that voice on-premise is like plugging in an API locally while paying for a good server. It needs a specific architecture. Anything is possible, but it's complex. We'll cover it in a future post — and how to reach a solid level of voice models on-prem with advanced architectures.
LLM Gateway: the piece that governs all your LLMs reliably
Once you're clear on the catalogue of models you'll use and where they'll run, there's a piece still missing in many of the architectures I come across, and it's the difference between operating professionally and improvising as you go: an LLM Gateway.
A clear example: Portkey (or similar products), open-source. What a layer like this gives you, translated into business impact:
- Virtual API keys: you manage keys per tenant, per client or per environment without touching the agent's code. If a client churns, a key is revoked; if you want to give more quota to another, a variable changes. Indispensable when you have multi-tenant or multi-account.
- Centralised guardrails: you apply input/output filters, refusal policies and schema validation in a single place, not on every agent. Change once, apply everywhere.
- Token cache: repeated calls are not billed again. In B2B products with recurring queries, this cuts your token spend seriously.
- Model fallback: if your primary provider goes down or returns an error, the gateway redirects the request to a secondary one without touching the product's code. Your agent keeps working.
- Business logic at the layer: routing per user, per plan, per region, per query type. Decisions that would otherwise live scattered across each agent's code.
- Consolidated observability and cost: a single view of how much each client, each flow, each model costs you. This is strategic information, not technical.
- Rate limiting and priorities: you protect your product from spikes and abuse without having to rewrite the agent.
A good LLM Gateway saves you money, keeps you out of trouble and, above all, makes your product operable without your technical team having to patch the system every time a model changes or a client demands something new.
If you're serious about an AI product, this layer isn't optional. It's what separates a product that scales with judgement from one that runs on heroics.
Idle Zero Cost: the hidden cost of isolation
We get to the point that surprises decision makers the most when I bring it up: the cost of agents that aren't doing anything.
The standard way to scale an AI SaaS is horizontal scaling on demand. One instance of the model or the agent serves N concurrent users and, when the load grows, more instances spin up. This is optimal from a cost perspective: a new instance is only created when the existing ones are near their capacity (with whatever margin you configure). The catch is that this gives you virtual data isolation: each user's agents are loaded into memory and the same machine runs several requests, in sequence or in parallel. This is usually called logical isolation.
But there are cases where you need physical isolation per client. When? When a high level of compliance kicks in. If your client requires that their data not be mixed with other clients' data, not even in memory, you'll have to deploy one instance of the agent per client, per tenant, and even per user in some verticals.
And here the hidden cost shows up: one client, one machine waiting to be used. The cost shoots up, even though most of those machines aren't processing any request.
It's not only about tokens. The architecture can blow up before a single user has spoken to your agent.
AI isn't cheap, and voice models are even less so. Multiplying dedicated instances by client, just like that, leads you to a fixed cost that grows linearly with the number of clients — regardless of whether they're using it.
The solution isn't new, but it is necessary: architectures that hibernate agents when they're not in use. Think of serverless-style patterns, or containers with native suspension: each agent only consumes resources when there's an active request. The rest of the time it's asleep.
And here's the relevant figure for the business side: B2B agents used internally are, on average, used barely 10% of the day. If you pay for 100%, you pay ten times what you should.
Idle Zero Cost leaves more margin to build more efficient products, pass less cost to the client and get better profit without giving up the isolation that compliance requires.
There's plenty to say here — concrete patterns, deployment options, how to choose between serverless and managed hibernation — and we'll see it in detail in a future post.
What's coming in Part 3 (and series close)
In the next part we'll close this series with two topics that didn't fit due to length and that deserve their own space:
- Synthetic data and measurable use cases before shipping to production.
- Compliance tiers and, above all, model compliance myths. What's the real risk of data leakage when you use a model? What exactly happens to your data when it reaches a provider? In which cases do models get trained on it, and in which don't they?
Once you understand how an AI model works and how it's trained, many of those fears come apart on their own. And the ones that remain, you'll be able to defend with judgement at the committee.
See you in the next post.


