AI Products Are Not Prompts — A Guide for C-Levels and Decision Makers (Part 2)

Liminal

by Jonathan Castro

The edge of change

Get notified when new posts are published at the edge of change

AI Products Are Not Prompts — A Guide for C-Levels and Decision Makers (Part 2)

#AI Strategy#AI Systems#Liminal

Business Value

Technical Complexity

After the first post in this series — where we saw that an AI product is not a prompt, but an architecture of several layers (Interface, Routing, Reasoning, Tools, Memory, Observability) — it's time to take it to the next level.

And as promised, in this second part we get into the determining factors a decision maker should put on the table before kicking off an AI project, among others:

Investment decisions: CAPEX vs OPEX applied to AI — the most prominent one.
Supported languages: dialects and non-majority languages lose quality fast outside the top-7.
Target latency: not every agent needs real time; some run perfectly well in batch.
Model catalogue: which ones you pick, how many you have, and how you manage them professionally with an LLM Gateway.
Idle Zero Cost for deployed agents: how to stop your architecture from bleeding while the agents aren't serving anyone.

The goal: that you reach the committee equipped to decide your product's investment strategy with AI concepts properly applied, read the P&L of your AI product properly, and pick the operational pieces that prevent cost from blowing up when you start scaling.

Let's get into it.

Investment strategy for the project applied to AI

Before we talk about models, latency or architecture, there's a classic business decision that still rules: do you want to invest more up front and less in operations —weighting towards CAPEX— or the opposite, weighting towards OPEX? And, as in any business, it depends. But in AI there are new factors in this decision.

What always applies and never changes: it depends on how much you'll use the product, how validated it is, what initial risk you're willing to take and at what scale you expect to operate. The answer can be pure on-premise, pure cloud or a hybrid — but the first question you have to answer yourself is whether your immediate priority is preserving cash and speed or predicting and bringing down unit cost in the long run.

If you haven't validated market fit yet, go OPEX. It's faster, requires less initial investment and gives you access to more powerful models. There will be time to optimise cost later, with real data in hand.

This is nothing new. In any startup, while you don't have validated market fit, speed matters more than margin. And that, translated into AI, means starting with cloud and frontier models via API, unless you have a direct constraint that pushes you to another option.

The reasons, in order:

Time to market: in a few hours you have an agent working against the top models. With on-premise, you're talking weeks or months just to get the environment operational.
Minimal initial investment: you pay per use, you don't buy GPUs. If the product doesn't work, you turn off the tap.
Access to the best the market has to offer: the most capable models are only available via API. You can't buy them.

Once your product is validated, you have real traffic and, above all, you have data on how your product is being used, you can start optimising cost: drop from frontier models to narrower ones, move part of the flow to open-source models on on-premise to bring down per-request cost, and reserve the big model only for the cases that actually need it.

Remember: you don't always need the most powerful model. Most tasks in an agentic product — classify, extract, summarise, validate — are handled perfectly well by small, fast models. Use of the frontier model is the exception, not the rule.

Watch out with on-prem: it isn't only about investment strategy

There's a nuance that doesn't come up often in decision-makers' conversations and that can leave you exposed if you decide purely from a spreadsheet. On-prem isn't only an economic decision: it has a real technical ceiling.

Some tasks — particularly heavy reasoning with many steps, very large contexts and agents that coordinate several decisions — are done much better by models like the top releases from OpenAI or Anthropic. In the case of OpenAI and the GPT family, we're talking about MoE (Mixture of Experts) architectures: models that internally hold hundreds of specialised sub-networks and only activate part of them for each request, with capability equivalent to networks of close to a trillion parameters. Anthropic hasn't confirmed it officially, but their flagship models are widely understood not to use MoE and are estimated at roughly 1T parameters as well.

That kind of model doesn't exist on-premise. It isn't downloadable. You can't rent it for your cluster. It is the frontier, and it is what differentiates you when complexity rises.

That said, important: you don't always need a model that powerful. Miracles can be done with good AI engineering, deterministic logic and architecture — combining inference with traditional code remains the most powerful guardrail, as we saw in the first part. But there are cases that do require those models, and you'd better know how to spot them before committing to a 100% on-prem decision.

Beyond raw capability, there are two other on-prem limitations worth putting on the table before you decide: latency and languages.

On latency, the big cloud providers have spent years optimising inference with dedicated hardware, aggressive batching, shared KV cache and models quantised specifically for their GPUs. Matching that on-prem is possible, but it demands specialised staff and enough volume to amortise the investment. If your product lives or dies by response time, this matters.

On languages, open-source on-premise models are reasonably well tuned in the world's top-7 (English, Spanish, French, German, Portuguese, Italian, Mandarin) and lose quality quite noticeably outside that. If your product has to handle accents, dialects or less represented languages well, be careful: what works fine on a cloud API can fall over on-premise or lose quality. Even more acute in voice models than in text.

Voice agents: an even more demanding case

Move to voice and on-premise gets even harder. There are no production-grade voice models on-premise that match the latency of the well-known voice models from the major providers. Alternatives exist, but they drag an extra problem along: voice models aren't good at reasoning. Their priority is to answer you fast, hear you well, and interpret your prosody and paralinguistic cues.

To reach a decent intelligence index on on-prem voice you have to add layers — a reasoning model that prepares the answer and a voice model that says it — and that shows up in latency, in coherence and in operational complexity.

In voice, on-premise isn't only more expensive to run: it's slower, less intelligent and more fragile. And the user notices: a few hundred milliseconds change how the product feels.

But wait, not everything has to be negative: I'm not saying it's impossible, just a bigger engineering challenge. Let's not assume that voice on-premise is like plugging in an API locally while paying for a good server. It needs a specific architecture. Anything is possible, but it's complex. We'll cover it in a future post — and how to reach a solid level of voice models on-prem with advanced architectures.

LLM Gateway: the piece that governs all your LLMs reliably

Once you're clear on the catalogue of models you'll use and where they'll run, there's a piece still missing in many of the architectures I come across, and it's the difference between operating professionally and improvising as you go: an LLM Gateway.

A clear example: Portkey (or similar products), open-source. What a layer like this gives you, translated into business impact:

Virtual API keys: you manage keys per tenant, per client or per environment without touching the agent's code. If a client churns, a key is revoked; if you want to give more quota to another, a variable changes. Indispensable when you have multi-tenant or multi-account.
Centralised guardrails: you apply input/output filters, refusal policies and schema validation in a single place, not on every agent. Change once, apply everywhere.
Token cache: repeated calls are not billed again. In B2B products with recurring queries, this cuts your token spend seriously.
Model fallback: if your primary provider goes down or returns an error, the gateway redirects the request to a secondary one without touching the product's code. Your agent keeps working.
Business logic at the layer: routing per user, per plan, per region, per query type. Decisions that would otherwise live scattered across each agent's code.
Consolidated observability and cost: a single view of how much each client, each flow, each model costs you. This is strategic information, not technical.
Rate limiting and priorities: you protect your product from spikes and abuse without having to rewrite the agent.

A good LLM Gateway saves you money, keeps you out of trouble and, above all, makes your product operable without your technical team having to patch the system every time a model changes or a client demands something new.

If you're serious about an AI product, this layer isn't optional. It's what separates a product that scales with judgement from one that runs on heroics.

Idle Zero Cost: the hidden cost of isolation

We get to the point that surprises decision makers the most when I bring it up: the cost of agents that aren't doing anything.

The standard way to scale an AI SaaS is horizontal scaling on demand. One instance of the model or the agent serves N concurrent users and, when the load grows, more instances spin up. This is optimal from a cost perspective: a new instance is only created when the existing ones are near their capacity (with whatever margin you configure). The catch is that this gives you virtual data isolation: each user's agents are loaded into memory and the same machine runs several requests, in sequence or in parallel. This is usually called logical isolation.

But there are cases where you need physical isolation per client. When? When a high level of compliance kicks in. If your client requires that their data not be mixed with other clients' data, not even in memory, you'll have to deploy one instance of the agent per client, per tenant, and even per user in some verticals.

And here the hidden cost shows up: one client, one machine waiting to be used. The cost shoots up, even though most of those machines aren't processing any request.

It's not only about tokens. The architecture can blow up before a single user has spoken to your agent.

AI isn't cheap, and voice models are even less so. Multiplying dedicated instances by client, just like that, leads you to a fixed cost that grows linearly with the number of clients — regardless of whether they're using it.

The solution isn't new, but it is necessary: architectures that hibernate agents when they're not in use. Think of serverless-style patterns, or containers with native suspension: each agent only consumes resources when there's an active request. The rest of the time it's asleep.

And here's the relevant figure for the business side: B2B agents used internally are, on average, used barely 10% of the day. If you pay for 100%, you pay ten times what you should.

Idle Zero Cost leaves more margin to build more efficient products, pass less cost to the client and get better profit without giving up the isolation that compliance requires.

There's plenty to say here — concrete patterns, deployment options, how to choose between serverless and managed hibernation — and we'll see it in detail in a future post.

What's coming in Part 3 (and series close)

In the next part we'll close this series with two topics that didn't fit due to length and that deserve their own space:

Synthetic data and measurable use cases before shipping to production.
Compliance tiers and, above all, model compliance myths. What's the real risk of data leakage when you use a model? What exactly happens to your data when it reaches a provider? In which cases do models get trained on it, and in which don't they?

Once you understand how an AI model works and how it's trained, many of those fears come apart on their own. And the ones that remain, you'll be able to defend with judgement at the committee.

See you in the next post.

Tras el primer post de esta serie — donde vimos que un producto de IA no es un prompt, sino una arquitectura de varias capas (Interface, Routing, Reasoning, Tools, Memory, Observability) — toca pasar al siguiente nivel.

Y como prometí, en esta segunda parte vamos a entrar en los factores determinantes que un decisor debería poner encima de la mesa antes de arrancar un proyecto de IA, entre otros:

Decisiones de inversión: CAPEX vs OPEX aplicado a IA — la decisión más prominente.
Idiomas soportados: dialectos e idiomas no mayoritarios pierden calidad muy rápido fuera del top-7.
Latencia objetivo: no todos los agentes necesitan tiempo real; algunos perfectamente pueden correr en batch.
Catálogo de modelos: cuáles eliges, cuántos tienes y cómo los gestionas profesionalmente con un LLM Gateway.
Idle Zero Cost para agentes desplegados: cómo evitar que tu arquitectura te haga sangrar mientras los agentes no están sirviendo a nadie.

El objetivo: que llegues al comité con criterio para decidir tu estrategia de inversión en el producto con conceptos de IA bien aplicados, leer bien la cuenta de tu producto de IA y elegir las piezas operativas que evitan que el coste se dispare cuando empieces a escalar.

Vamos a ello.

Estrategia de inversión en el proyecto aplicada a IA

Antes de hablar de modelos, latencia o arquitectura, hay una decisión clásica de negocio que sigue mandando: ¿quieres invertir más al inicio y menos en la operación —prominencia en CAPEX— o lo contrario, prominencia en OPEX? Y, como en cualquier negocio, depende. Pero en IA hay factores nuevos en esta decisión.

Lo que siempre aplica y no cambia: depende de cuánto vas a usar el producto, cuán validado está, qué riesgo inicial estás dispuesto a asumir y a qué escala esperas operar. La respuesta puede ser on-premise puro, cloud puro o un híbrido — pero la primera pregunta que tienes que responderte es si tu prioridad inmediata es conservar caja y velocidad o predecir y abaratar el coste unitario a largo plazo.

Si no tienes claro el market fit, ve a OPEX. Es más rápido, requiere menos inversión inicial y te da acceso a modelos más potentes. Ya tendrás tiempo de optimizar costes después, con datos reales en la mano.

Esto no es algo nuevo. En cualquier startup, mientras no tienes market fit validado, la velocidad importa más que el margen. Y eso, traducido a IA, significa empezar por cloud y frontier models vía API, a no ser que tengas un constraint directo que te lleve a otra opción.

Las razones, en orden:

Time to market: en pocas horas tienes un agente funcionando contra los modelos punteros. En on-premise, hablas de semanas o meses solo para tener el entorno operativo.
Inversión inicial mínima: pagas por uso, no compras GPUs. Si el producto no funciona, apagas el grifo.
Acceso a lo más potente del mercado: los modelos más capaces solo están disponibles vía API. No los puedes comprar.

Una vez que tu producto está validado, tienes tráfico real y, sobre todo, tienes datos de cómo se usa tu producto, puedes empezar a optimizar costes: bajar de frontier models a modelos más acotados, mover parte del flujo a modelos open-source en on-premise para abaratar el coste por petición, y reservar el modelo grande solo para los casos que realmente lo necesitan.

Recuerda: no siempre necesitas el modelo más potente. La mayoría de tareas de un producto agentic — clasificar, extraer, resumir, validar — se hacen perfectamente con modelos pequeños y rápidos. El uso del frontier model es la excepción, no la regla.

Cuidado con on-prem: no todo se reduce a la estrategia de inversión

Hay un matiz que se cuela poco en las conversaciones de los decisores y que te puede dejar expuesto si decides solo desde la hoja de Excel. On-prem no es solo una decisión económica: tiene un techo técnico real.

Algunas tareas — sobre todo razonamientos pesados con muchos pasos, contextos enormes y agentes que coordinan varias decisiones — se hacen mucho mejor con modelos como los punteros de OpenAI o Anthropic. Hablamos, en el caso de OpenAI y la familia de GPTs, de arquitecturas MoE (Mixture of Experts): modelos que internamente tienen cientos de subredes especializadas y activan solo una parte para cada petición, con una capacidad equivalente a redes de cerca de un trillón de parámetros. En el caso de Anthropic no es oficial, pero todo apunta a que sus modelos punteros no usan MoE y se estiman también en torno a 1T de parámetros.

Ese tipo de modelos no existen en on-premise. No son descargables. No los puedes alquilar para tu cluster. Son la frontera y son lo que te diferencia cuando la complejidad sube.

Ahora bien, importante: no siempre necesitas un modelo tan potente. Se pueden hacer milagros con buen AI engineering, lógica determinista y arquitectura — combinar inferencia con código tradicional sigue siendo el guardrail más potente, como vimos en la primera parte. Pero hay casos que sí los requieren, y conviene que sepas distinguirlos antes de comprometerte con una decisión 100% on-prem.

Más allá de la capacidad pura, hay otras dos limitaciones de on-premise que conviene tener encima de la mesa antes de decidir: latencia e idiomas.

En latencia, los grandes proveedores cloud llevan años optimizando inferencia con hardware dedicado, batching agresivo, KV cache compartida y modelos cuantizados específicamente para sus GPUs. Igualar eso en on-prem es posible, pero exige equipo especializado y un volumen de uso suficiente para amortizar la inversión. Si tu producto vive de la velocidad de respuesta, esto pesa.

En idiomas, los modelos open-source on-premise están razonablemente bien afinados en el top-7 mundial (inglés, español, francés, alemán, portugués, italiano, mandarín) y van perdiendo calidad de forma muy notoria fuera de ahí. Si tu producto tiene que servir bien acentos, dialectos o idiomas menos representados, cuidado: lo que en cloud vía API te funciona, en on-premise puede caerse o perder calidad. Aún más acuciante en modelos de voz que en texto.

Agentes de voz: un caso aún más exigente

Si vamos al mundo voice, on-premise se complica todavía más. No hay modelos production-grade de voz en on-premise capaces de igualar la latencia de los modelos de voz de los grandes proveedores que conoces. Existen alternativas, pero arrastran un problema añadido: los modelos de voz no son buenos razonando. Su prioridad es contestarte rápido, escucharte bien, interpretar tu prosodia y los matices paralingüísticos.

Para llegar a un intelligence index decente en on-prem voice tienes que añadir capas — un modelo de razonamiento que prepara la respuesta y un modelo de voz que la habla — y eso se nota en la latencia, en la coherencia y en la complejidad operativa.

En voice, on-premise no es solo más caro de operar: es más lento, menos inteligente y más frágil. Y al usuario eso le importa: unos cientos de milisegundos cambian cómo se siente el producto.

Pero espera, no todo va a ser negativo: no digo que sea imposible, sino un reto mayor de ingeniería. No pensemos que voz en on-premise es como conectar una API pero en local pagando un buen server. Requiere de una arquitectura específica. Todo es posible, pero es complejo. Hablaremos de ello en un futuro post y de cómo llegar a un buen nivel de voice models en on-prem con arquitecturas avanzadas.

LLM Gateway: la pieza que gobierna todos tus LLMs de forma solvente

Una vez tienes claro el catálogo de modelos que vas a usar y dónde se van a ejecutar, queda una pieza que sigue ausente en muchas arquitecturas que veo, y que es la diferencia entre operar de forma profesional o ir improvisando: un LLM Gateway.

Un ejemplo claro: Portkey (o productos similares), open-source. Lo que aporta una capa así, traducido a impacto de negocio:

Virtual API keys: gestionas claves por tenant, cliente o entorno sin tocar el código del agente. Si un cliente se va, una clave se revoca; si quieres dar más cuota a otro, una variable cambia. Imprescindible si tienes multi-tenant o multi-cuentas.
Guardrails centralizados: aplicas filtros de input/output, políticas de rechazo y validación de esquema en un solo sitio, no en cada agente. Cambias una vez y se aplica a todo el catálogo.
Cache de tokens: las llamadas repetidas no se vuelven a facturar. Esto, en productos B2B con preguntas habituales, recorta tu gasto en tokens de forma muy seria.
Fallback entre modelos: si tu proveedor principal cae o devuelve un error, el gateway reenvía la petición a un secundario sin tocar el código del producto. Tu agente sigue funcionando.
Lógica de negocio en la capa: routing por usuario, por plan, por región, por tipo de consulta. Decisiones que de otra forma vivirían dispersas por el código de cada agente.
Observabilidad y costes consolidados: una sola vista de cuánto te cuesta cada cliente, cada flujo, cada modelo. Esto es información estratégica, no técnica.
Rate limiting y prioridades: proteges tu producto de picos y de abusos sin tener que reescribir el agente.

Un buen LLM Gateway te ahorra dinero, te quita disgustos y, sobre todo, hace que tu producto sea operable sin que tu equipo técnico tenga que parchear el sistema cada vez que cambia un modelo o un cliente exige algo nuevo.

Si te tomas en serio el producto de IA, esta capa no es opcional. Es lo que separa un producto que escala con criterio de un producto que vive a base de heroicidades.

Idle Zero Cost: el coste oculto del aislamiento

Llegamos al punto que más sorprende a los decisores cuando se lo cuento: el coste de los agentes que no están haciendo nada.

La forma estándar de escalar un producto SaaS de IA es escalado horizontal por demanda. Una instancia del modelo o del agente atiende a N usuarios concurrentes y, cuando la carga sube, se levantan más instancias. Esto es óptimo desde el punto de vista de coste: solo se crea una nueva instancia si las anteriores están al borde de su capacidad (con el margen que configures). El problema es que esto hace un aislamiento de datos virtual: se van cargando en memoria los agentes de cada usuario y la misma máquina va corriendo, en orden o en paralelo, varias peticiones. Esto se suele llamar aislamiento lógico.

Pero hay casos en los que necesitas aislamiento físico por cliente. ¿Cuándo? Cuando entra en juego un alto nivel de compliance. Si tu cliente exige que sus datos no se mezclen con los de otros clientes, ni siquiera en memoria, vas a tener que desplegar una instancia del agente por cliente, por tenant, e incluso por usuario en algunos verticales.

Y aquí aparece el coste oculto: cada cliente, una máquina esperando a ser usada. El coste se dispara, aunque la mayoría de esas máquinas no estén procesando peticiones.

No todo son tokens. La arquitectura se te puede disparar antes de que un solo usuario haya hablado con tu agente.

La IA no es barata, y los modelos de voz lo son aún menos. Multiplicar instancias dedicadas por cliente, sin más, te lleva a un coste fijo que crece linealmente con el número de clientes — sin importar si lo están usando.

La solución no es nueva, pero sí necesaria: arquitecturas que hibernan los agentes cuando no se usan. Pensemos en patrones tipo serverless o contenedores con suspensión nativa: cada agente solo gasta recursos cuando hay una petición activa. El resto del tiempo está dormido.

Y aquí va el dato relevante para negocio: los agentes B2B de uso interno se utilizan, de media, apenas un 10% del día. Si pagas por el 100%, pagas 10 veces lo que toca.

El Idle Zero Cost deja más margen para hacer productos más eficientes, repercutir menos al cliente y obtener mejor beneficio sin renunciar al aislamiento que el compliance exige.

Esto da para mucho — patrones concretos, opciones de despliegue, cómo decidir entre serverless e hibernación gestionada — y lo veremos en detalle en un próximo post.

Lo que viene en la Parte 3 (y cierre de la serie)

En la siguiente parte cerraremos esta serie con dos temas que se quedaron fuera por extensión y que merecen su sitio:

Datos sintéticos y casos de uso medibles antes de salir a producción.
Compliance tiers y, sobre todo, los mitos de compliance de los modelos. ¿Cuál es el riesgo real de filtración de datos cuando usas un modelo? ¿Qué ocurre exactamente con tus datos cuando llegan a un proveedor? ¿En qué casos se entrenan modelos con ellos y en cuáles no?

Una vez entiendas cómo funciona y cómo se entrena un modelo de IA, muchos de esos miedos se desmontan solos. Y los que quedan, los podrás defender con criterio en el comité.

Nos vemos en el siguiente post.

About Jonathan Castro