Executive summary
As large language models scale across the enterprise, the primary constraints are no longer model intelligence but token cost, inference latency, reliability, and sustainability. This paper introduces token economics as the governing framework for scalable, cost-optimal LLM systems, arguing that tokens—not parameters—are the true unit of cost and risk.
The central insight is that token inefficiency is largely locked in long before inference, through decisions around training data quality, language distribution, and domain alignment. Poor data quality and imbalance directly translate into longer prompts, verbose outputs, higher hallucination rates, and increased reliance on guardrails—multiplying downstream inference spend.
The paper further shows that architecture and scaling choices are economic decisions, not purely technical ones. While transformers improved reasoning per token, sequential decoding makes output length the dominant cost driver. Larger models do not guarantee better outcomes; sparse architectures, compute-optimal training, and task-aligned models often deliver superior cost-normalized performance.
At inference scale, the paper demonstrates why multi-model routing and speculative decoding are critical system levers. By allocating routine requests to smaller models and escalating only when complexity demands it, organizations can dramatically reduce average cost and tail latency without compromising quality.
The conclusion is clear: AI advantage will be defined by system design, not model size. Organizations that treat tokens as a finite economic resource and orchestrate intelligence across model portfolios will achieve durable gains in scalability, cost efficiency, and reliability.
Token economics refers to how tokens are produced, consumed, and wasted across training, inference, and orchestration—and how those flows determine cost, latency, reliability, and scalability.
Training data
Training data is the first token optimization decision
Most discussions on token optimization focus on prompts, truncation, or inference-time tricks. In reality, the largest and most irreversible token decision is made far earlier: in training data selection and composition. Training data does not merely shape model capability; it directly determines how many tokens are required to achieve acceptable accuracy, safety, and consistency at inference time.
Foundation models are often treated as neutral, general-purpose engines. In practice, they are statistical reflections of the data they consume—its languages, domains, structure, and noise. Every inefficiency or bias in training data compounds downstream as longer prompts, higher token counts, additional guardrails, or fallback systems.
Quantity does not equal efficiency
A common misconception is that more data invariably produces better models. Empirical evidence shows the opposite: smaller volumes of high-quality, task-aligned data routinely outperform massive volumes of low-quality web crawl data.
From a token-economics perspective:
- Models trained on noisy or weakly relevant data require longer prompts to disambiguate intent.
- They rely more heavily on few-shot examples, increasing input token cost.
- They produce longer outputs to hedge uncertainty, increasing output token cost.
- They exhibit higher variance, forcing downstream retries or validation passes.
In contrast, models trained on curated, high-signal data demonstrate tighter probability distributions. This translates to shorter prompts, lower temperature requirements, fewer corrective instructions, and more compact outputs.
The business implication is clear: training data quality is a first-order lever on inference spend.
Language distribution drives token cost and latency
Training data is overwhelmingly dominated by English text. This imbalance has two direct operational consequences:
- Performance asymmetry
Models reason, follow instructions, and solve problems more accurately in English than in under-represented languages. This forces teams to compensate with:- Prompt amplification
- Translation layers
- Multi-pass reasoning
- Token inefficiency across languages
Tokenization efficiency varies dramatically by language. For the same semantic content:- English may require single-digit tokens
- Some Indic and Southeast Asian languages require 5–10× more tokens
Since inference latency and cost scale linearly with token count, this creates a structural disadvantage:
- Slower responses
- Higher per-request cost
- Lower throughput under the same infrastructure
From a leadership standpoint, this means global scalability is not just a localization problem—it is a token economics problem rooted in training data choices.
Translation is not a free optimization
A tempting workaround is to translate all non-English inputs into English, process them, and translate outputs back. While appealing on paper, this introduces hidden costs:
- Information loss, especially in languages where grammar encodes social hierarchy, formality, or relational context.
- Additional token overhead from translation steps.
- Error compounding, where mistranslation degrades reasoning before the model even begins inference.
In practice, translation pipelines often increase total token consumption while reducing semantic fidelity.
Data quality Is a safety and hallucination control mechanism
Training data quality also governs how confidently a model answers questions it should not. Models trained on noisy or misleading sources:
- Hallucinate more aggressively
- Require longer safety instructions
- Depend on post-generation filtering
Each of these mechanisms consumes tokens and compute.
Conversely, high-quality, well-curated datasets produce models that:
- Say “I don’t know” earlier
- Require fewer safety constraints
- Generate shorter, more precise answers
This is an underappreciated insight: Hallucination mitigation is not only an alignment problem, but a data curation problem with direct cost implications as well.
Domain-specific data Is a token multiplier
General-purpose models appear versatile, but they are inherently inefficient for specialized tasks. When a model has not seen a domain during training:
- Prompts become verbose
- Context windows are saturated with background explanations
- Outputs are longer and less precise
Domain-specific training or fine-tuning compresses this inefficiency. A domain-aware model:
- Requires less context to understand intent
- Produces more structured outputs
- Achieves higher accuracy with fewer tokens
From a system design lens, domain specialization is one of the highest-ROI token optimization strategies available.
Model architecture
Model architecture is a token economics decision
If training data determines what a model knows, architecture determines how expensively that knowledge is accessed. Most organizations discuss architecture in terms of accuracy or scale. In production systems, the more consequential question is simpler:
How many tokens—and how much time—does it take for the model to arrive at a useful answer?
Why transformers won—and why that matters for tokens
The dominance of the transformer architecture is not accidental. It solved two structural inefficiencies of earlier sequence models:
- Information bottleneck
Seq2seq models compressed the entire input into a single hidden state. This forced models to “guess” based on summaries rather than evidence, increasing hallucination risk and output verbosity. - Sequential processing
RNN-based models processed tokens one at a time, making long inputs slow and expensive to consume.
Transformers replaced both with attention, allowing each output token to directly reference all prior tokens. This dramatically improved answer quality per token generated, reducing the need for long explanations or defensive verbosity.
In other words, attention was not just a modeling breakthrough—it was a token efficiency breakthrough.
The hidden cost of autoregression
While transformers removed sequential bottlenecks on the input side, they retained one critical limitation:
Decoding remains sequential.
Every generated token depends on the previous one. This creates two inference phases with very different cost profiles:
- Prefill: Parallel, compute-heavy, amortized over input length
- Decode: Sequential, memory-bound, proportional to output length
This has profound implications:
- Short, precise outputs scale well
- Long, verbose outputs scale poorly
- Any architectural or prompt decision that increases output length multiplies latency and cost
This is why token optimization efforts disproportionately focus on output control, not just input compression.
Attention is powerful—and expensive
Attention works by computing interactions between tokens using query, key, and value vectors. While this enables rich reasoning, it introduces two scaling constraints:
- Memory growth
Each token adds key and value vectors that must be stored and reused. Longer contexts mean higher memory pressure. - Quadratic compute
Full attention scales poorly with sequence length, making naïvely increasing context windows economically unsustainable.
This is why context length is not “free,” even when models technically support it. Large context windows often:
- Increase latency
- Reduce throughput
- Inflate infrastructure costs
- Encourage prompt bloat instead of prompt discipline
From a leadership perspective, long context is a strategic resource, not a default setting.
Multi-head attention: Precision over redundancy
Multi-head attention allows models to attend to different patterns simultaneously. This improves reasoning density but also increases:
- Parameter count
- Intermediate activations
- Memory traffic during inference
Well-designed models strike a balance: enough heads to capture structure, but not so many that each additional head produces diminishing returns per token.
This is why smaller, well-tuned models often outperform larger ones on cost-normalized benchmarks.
Model size: Capacity versus efficiency
Increasing parameter count increases capacity—but not linearly increasing efficiency.
In practice:
- Larger models tend to produce longer outputs
- They are more likely to “explain themselves”
- They incur higher decode latency even for simple tasks
For many enterprise workloads, the optimal model is not the largest available one, but the smallest model that:
- Requires minimal prompt scaffolding
- Produces compact, high-confidence outputs
- Avoids retry loops and guardrail triggers
This is why production systems increasingly rely on model portfolios, not single monoliths.
Beyond transformers: Architecture as a cost lever
Emerging architectures such as state space models and hybrid designs are not academic curiosities—they are responses to transformer inefficiencies:
- Linear or near-linear scaling with sequence length
- Lower memory footprints
- Better long-context behavior without quadratic attention costs
Hybrid models that selectively use attention where it matters and cheaper mechanisms elsewhere signal an important shift:
Future token optimization will come as much from architecture as from prompting.
Scaling is a budget allocation problem, not a model size problem
The industry narrative around LLMs has long equated progress with larger models. In practice, scaling is constrained by three finite resources: tokens, compute, and energy. Modern LLM optimization is therefore less about “how big can we go” and more about how efficiently we convert spend into capability.
Parameters are a proxy, not the cost
Parameter count is often treated as the headline metric of a model’s scale. While useful, it is an incomplete signal.
- Parameters estimate memory footprint
- They loosely correlate with learning capacity
- They do not directly represent inference cost
Sparse architectures make this explicit. A model may contain tens of billions of parameters, yet only activate a fraction of them per token. In such systems, active parameters per token, not total parameters, determine latency and cost.
This distinction matters because it breaks a common assumption:
a larger model can be cheaper to run than a smaller one if sparsity is used effectively.
Mixture-of-experts: Token-efficient scale
Mixture-of-experts (MoE) architectures operationalize this idea. Instead of invoking the full model for every token, its routing selectively activates a small subset of experts.
The consequence is subtle but profound:
- Capacity scales with total parameters
- Cost scales with active parameters per token
From a token-economics perspective, MoE allows organizations to buy intelligence in bulk while paying per-use, much like cloud compute itself.
However, MoE introduces operational complexity:
- Routing instability
- Load imbalance across experts
- Harder latency guarantees
This reinforces a recurring theme: token efficiency gains often trade off against system simplicity.
Tokens are the true unit of learning
Counting training samples is no longer meaningful. A sentence, a document, and a book are not equivalent learning events. Tokens are.
Tokens matter because:
- They are the unit the model optimizes against
- They determine how much statistical signal the model can absorb
- They correlate more directly with downstream performance than raw dataset size
At scale, this leads to a critical distinction:
- Dataset tokens: the size of the corpus
- Training tokens: dataset tokens × number of epochs
Two models trained on the same dataset can have radically different capabilities depending on how many times they see the data.
Compute-optimal training: The end of guesswork
For years, training large models resembled informed experimentation. Scaling laws changed this.
Given a fixed compute budget, there exists an optimal balance between model size and number of training tokens. Training a model that is too large on too little data wastes parameters. Training a small model on massive data wastes capacity.
This insight reframes leadership decisions:
- Budget first
- Architecture second
- Model size last
A compute-optimal model is not the largest possible model—it is the best-performing model for a fixed spend.
Why bigger models can perform worse
Scaling is not monotonic. Larger models can underperform smaller ones due to:
- Insufficient or low-quality data
- Over-regularization during alignment
- Tasks with strong priors or memorization requirements
This phenomenon—sometimes called inverse scaling—has important implications for production:
- Bigger models are not universally safer
- More alignment does not always mean better alignment
- Overconfident generalization can degrade task accuracy
For enterprises, this means model selection must be task-aware, not prestige-driven.
Diminishing returns are real—and expensive
As models improve, each incremental gain becomes more costly:
- Reducing error from 3% to 2% may require an order of magnitude more data or compute
- Small reductions in training loss can unlock large downstream quality gains
- But the marginal cost of those gains rises sharply
This creates a natural economic ceiling. At some point, further scaling:
- Increases infrastructure spend
- Increases energy consumption
- Increases organizational risk without proportionate business value
The emerging bottlenecks: Data and energy
Two hard constraints are becoming unavoidable:
- Data exhaustion
High-quality, human-generated data is finite. As public data becomes restricted, proprietary data becomes the primary competitive moat. - Energy consumption
Compute scales faster than energy production. Electricity, not GPUs, is increasingly the binding constraint.
Token optimization is therefore not just a cost concern—it is a sustainability strategy.
Multi-model routing and speculative decoding: Turning intelligence into a cost-controlled system
As models scale, the dominant cost driver is no longer training—it is inference at volume. In production, the economic question is not how smart a model is, but how often that intelligence is required.
This is where multi-model routing and speculative decoding fundamentally change the cost curve.
The core insight: Not all tokens deserve the same model
Most enterprise workloads exhibit a predictable distribution:
- A large majority of requests are simple, repetitive, or well-structured
- A small minority require deep reasoning, synthesis, or ambiguity resolution
Running every request through a frontier-grade model is equivalent to running every database query on a distributed analytics engine. It works—but it is wasteful.
Multi-model routing treats intelligence as a tiered resource, not a monolith.
Multi-model routing as an economic primitive
In a routed system:
- Smaller, faster, cheaper models handle the majority of requests
- Larger models are invoked only when confidence, complexity, or risk thresholds are crossed
This can be implemented using:
- Confidence estimation
- Output entropy
- Rule-based heuristics
- Lightweight classifiers
- Shallow “preview” generations
The critical outcome is this:
Most tokens are processed by the cheapest viable model.
From a cost perspective, this yields multiplicative savings:
- Lower average cost per request
- Lower tail latency
- Higher throughput on fixed infrastructure
From a reliability perspective, it also reduces hallucination risk by escalating only genuinely uncertain cases.
Why smaller models often perform better than expected
Smaller models have two structural advantages in routed systems:
- They are less verbose
- They converge faster during decoding
For routine tasks—classification, extraction, templated summaries—their probability mass is often sharper than that of larger models, which tend to over-generalize.
This creates a counterintuitive outcome:
Smaller models can be both cheaper and more accurate for large portions of production traffic.
Speculative decoding: Trading probability for latency
Speculative decoding addresses a different bottleneck: sequential decoding latency.
The idea is simple:
- A small model proposes a sequence of candidate tokens
- A larger model verifies them in parallel
- Correct tokens are accepted in bulk; incorrect ones are discarded
Instead of waiting for the large model to emit tokens one at a time, it is allowed to approve multiple tokens per step.
The result:
- Lower time-to-first-answer
- Lower wall-clock latency
- Fewer decoding steps on the expensive model
Importantly, speculative decoding does not compromise output quality—the large model remains the final authority.
Token economics of speculative decoding
From a token standpoint:
- Small-model tokens are cheap
- Large-model tokens are expensive
- Verification is cheaper than generation
Speculative decoding shifts the workload from expensive generation to cheaper verification, collapsing latency without inflating hallucination risk.
This is particularly effective when:
- Outputs are moderately long
- Language structure is predictable
- The small model is well-aligned with the domain
Routing + speculation: A compound effect
Individually, routing and speculation offer savings. Together, they reshape system economics.
A typical high-efficiency pipeline looks like:
- Route request to a small model
- Accept result if confidence is high
- Otherwise, invoke a larger model
- Use speculative decoding to accelerate generation
- Escalate only if verification fails
This creates a graceful degradation curve:
- Cost increases only when complexity demands it
- Latency scales with difficulty, not worst-case assumptions
Organizational implications
For leadership, this changes how AI capability should be evaluated:
- Model quality is no longer a single metric
- System-level performance matters more than peak intelligence
- Cost efficiency becomes a design constraint, not a finance afterthought
Teams that master routing and speculation are effectively building intelligence supply chains, not just deploying models.
Executive takeaway
The future of LLM efficiency is not a better single model.
It is better orchestration of multiple models.
Multi-model routing ensures:
- Cheap tokens are used first
- Expensive tokens are used only when justified
Speculative decoding ensures:
- Sequential bottlenecks are amortized
- Latency scales sub-linearly with output length
Together, they turn LLMs from static engines into adaptive, cost-aware systems.
Written by:
Andy Logani
Executive Vice President and Chief AI Officer, EXL
Somya Rai
Vice President and AI Architect, EXL