The One-Token Model
A unified framework for measuring the financial & environmental impact of AI inference.
Published on : December 1st, 2025 by Antarctica Global Technology & Consulting Pvt. Ltd.

Enterprise adoption of generative AI has expanded rapidly, with recent surveys indicating that approximately 90% of organizations have integrated AI into at least one workflow. Yet, despite this widespread uptake, most enterprises remain confined to exploratory or pilot-stage implementations. This limitation is not due to inadequacies in model capability, or technical abilities, but to the absence of a standardized, rigorous framework for quantifying computational work. Today, there is no dependable, foundational measurement framework for organizations to understand the impact of their AI investments.
In the absence of such a foundational metric, organizations are unable to evaluate efficiency, characterize model behavior, or establish reliable relationships between usage patterns, cost structures, and associated energy or emissions impact. The absence of this measurement foundation prevents enterprises from building predictable budgets, scaling workloads responsibly, and enforcing AI governance with confidence.
This gap shows up in three distinct ways:
1. Measurement of Usage
While leading AI model providers price by API call, token, or GPU hour, there is no industry-wide, deeply-accepted standard that allows organizations to truly understand and compare the compute effort or resource use behind different jobs or workflows.
2. Rising Costs of AI Usage
As models become larger and more complex, and as backend architectures (servers, batch sizes, mixtures of expert models) become more advanced, billing structures grow less transparent for the enterprise buyer. Organizations rarely receive detailed breakdowns of how their usage, prompt complexity, or model choice contribute to total compute cost. This makes budgeting unpredictable and optimization difficult.
3. The Environmental Impact of this Usage
AI’s energy use and carbon footprint are rarely transparent. Google’s disclosure that a median Gemini prompt uses 0.10 Wh and emits 0.02 gCO₂e is directionally useful. But a median value conceals the variability across prompts of different lengths, structures, and complexities, leaving organizations without insight into the full distribution of environmental impact.
Every meaningful action performed by an AI model today, whether understanding text, analyzing an image, interpreting audio, or generating a response, ultimately manifests as computation over tokens.
Tokens are the atomic units through which large language and multimodal models perceive, process, and produce information. They form the only universal unit that spans evaluation, inference, cost, hardware usage, and environmental impact.

During evaluation, models are tested with structured prompts to measure accuracy, coherence, and task performance. These tests also reveal how many tokens a model must process to achieve a given level of quality. When translated into energy, or cost per token, evaluation benchmarks become multidimensional, allowing organizations to compare not just accuracy but energy and cost efficiency across model versions or configurations.
However, the real impact emerges in inference. Google reported processing 1.3 quadrillion tokens monthly in 2025, a scale so large that raw token counts become abstract. The way to resolve this ambiguity is by translating token volume into quantifiable cost, usage, and energy consumption, turning statistically overwhelming numbers into operationally relevant metrics.
| Provider | Reported Token volume (Monthly, 2025) | Notes/Source |
|---|---|---|
| 1.3 Quadrillion (1.3 x 10¹⁵) | Across all surfaces; doubled from 480 trillion in May to 980 trillion in July, reaching 1.3 quadrillion by summer. | |
| >259 Trillion (API only) | API at >6 billion tokens/min; total including ChatGPT estimated higher but not publicly detailed; 800 million weekly active users. | |
| 1.7 Trillion (Foundry product) | Specific to Foundry; broader Copilot usage likely higher but no aggregate reported; quotas up to 32 billion for GPT-5 models. | |
| ~25 Trillion (estimated) | 25 billion API calls in Q2; assuming ~1,000 tokens per call; 30 million monthly active users. |
Tokens are the model's internal representation of meaning. Just as humans rely on words, models rely on tokens: discrete, structured units that encode inputs, resolve context, and output information. Because all computation happens on tokens, they become the only unit that measures four critical dimensions:
By late 2025, a growing body of practice and research places tokens at the centre of how AI is measured, priced, and optimised. Providers increasingly expose token-based limits, routing rules, and pricing tiers. New hardware generations such as Blackwell, MI300, and Gaudi make token-level behaviour far easier to observe through metrics like tokens per second, per watt, and per joule.
This direction is echoed in the Stanford AI Index 2025, which emphasises token-normalised benchmarks for comparing inference cost, efficiency, and carbon intensity. They highlight a substantial reduction in inference costs since 2022, now commonly measured in token units, and encourage hybrid evaluation that pairs token usage with actual outputs. Complementary research on token efficiency, such as the Token Length Control with Dynamic Rewards (TLDR)-style dynamic reward shaping, demonstrates that substantial reductions in token usage are possible without affecting accuracy, particularly for reasoning and maths-heavy tasks.
At the application layer, similar patterns appear in how enterprises design and operate AI products. Teams increasingly treat token usage as a KPI, budgeting and allocating costs in tokens, prompt engineering in RAG pipelines, agent orchestration, caching, and session management. Sparkco AI’s 2025 analysis illustrates this shift, documenting 30–40% token reductions in real deployments through retrieval optimization, pruning, batching, and improved memory management. Emerging frameworks also assess the efficiency of tokenization itself.
These developments reflect a broader move toward understanding how much compute, cost, and energy each token represents.
Model providers, across closed, and open-source releases, price their APIs (OpenAI, Gemini, Anthropic, Deepseek and others) exclusively in terms of tokens, distinguishing between input tokens (what the user sends into the model) and output tokens (what the model generates) and often adding separate rates for cached/context tokens.. This differentiation of input, output, and context-extension tokens is because each carries a different computational footprint, from attention cost to KV-cache pressure, driving more granular cost models, including surcharges for long contexts or discounted rates for efficient batching and (in the future) low-carbon regions.
Real-time token telemetry is now standard. API users receive real-time token counts, burn-rate signals, and “token waste” diagnostics, enabling prompt optimization, throttling, and model switching. Token behaviour also shapes modern inference scheduling. Modern serving systems: multi-model routing, mixtures of experts, and chip-aware orchestrators use token-arrival rates and tokens-per-joule measurements to size batches, route workloads, andmaintain SLAs across large fleets. Providers additionally, under pressure from the community have begun to disclose energy-per-token and carbon-per-token metrics (see Google’s per token emissions analysis).
In the open-source ecosystem, token-normalised benchmarks (tokens per task, latency per token, and energy per token), are now common across multilingual and multimodal evaluations. (see Hugging Face’s AI Energy Score)
Together, these practices make tokens the provider’s operational reference point for pricing, routing, efficiency, and sustainability.
| Model | Input Price (per 1M tokens) | Output Price (per 1M tokens) |
|---|---|---|
| $1.25 | $10.00 | |
| $3 | $15 / MTok | |
| $1.2, prompts <= 200k tokens $4, prompts > 200k tokens | $10.00, prompts <= 200k tokens $15.00, prompts > 200k tokens | |
| $0.19-$0.49 (3:1 blended) | $0.19-$0.49 (3:1 blended) | |
| $0.20 | $0.50 | |
| $0.028-0.28 | $0.42 | |
| $0.4 | $2.0 |
At an infrastructure level, every token processed by a model triggers real, measurable work on the accelerator, moving data through memory, running transformer blocks, hitting or missing caches, and drawing power. New GPU and accelerator stacks now expose per-token telemetry, reporting how much bandwidth, cache activity, heat, and power each segment of computation uses. Cloud and on-prem orchestration systems collect this data into live dashboards and sustainability reports, giving operators a detailed view of the physical cost of each token.
This level of visibility has reshaped tooling. OpenTelemetry extensions now treat tokens as first-class units, and FinOps teams combine cost, power and workload metrics to calculate tokens-per-joule, and cost, carbon-per-prompt. These metrics feed internal dashboards, SLAs, and even customer billing. Green routing frameworks (for example, GreenPT’s green router) help to choose the best model for each request, shifting workloads to cleaner regions or delaying inference when the grid is under stress.
Multimodal models add nuance: text tokens, image patches, and audio segments run through different paths, so comparisons often use normalised semantic units or composite efficiency scores to reflect equivalent work. At the same time, operators increasingly attribute part of the hardware’s lifecycle (Scope 3) emissions to inference, giving a fuller picture of carbon intensity. These capabilities are no longer limited to major hyperscalers. Edge devices and local GPUs now ship with SDKs that report tokens-per-joule or carbon-per-prompt directly to end users.
Tracking per-token metrics at the hardware layer enables operators to:
In the context of tokens, “users” primarily refers to the teams, products, and organizations that consume model capacity through APIs or embedded AI workflows, not just individual end-consumers typing into a chat interface.
Enterprise and developer users receive detailed token telemetry from providers, allowing them to allocate compute costs across departments or features, optimise prompts and RAG pipelines, and monitor energy and performance per API call. For SaaS builders and product managers, token consumption directly shapes the economics of their products, even if their customers only see high-level abstractions like “queries processed” or “documents analysed.”
Professional and power users, such as those relying on GitHub Copilot or AI productivity tools, sometimes interact with token limits indirectly through quotas or usage tiers. For them, “fewer tokens” can translate into staying within plan limits or achieving faster interactions, but not always into direct cost savings. In contrast, casual consumer users (like someone using ChatGPT or Copilot on a personal plan) rarely see tokens at all; their experience is governed by fixed envelopes, input caps, or rate limits rather than per-token economics or sustainability benefits. In enterprise environments, internal employees or client users typically never manage tokens directly, but enterprise IT and AI managers do, feeding token metrics into billing, reporting, optimisation, and environmental dashboards.
Across these contexts, tokens matter as an operational, economic, and sustainability metric only for the users who are responsible for or billed by their usage. For others, the effects are indirect: efficiency at the organisational level improves speed, reliability, and sustainability downstream.
To summarize, tracking per token metrics enables users to:
Across modalities, whether the AI model is reading text, analyzing an image, or listening to audio, it always breaks the input into small, understandable pieces called tokens.

Text Tokenization
Text is decomposed into small units: characters, sub-words, or words, through tokenizers such as OpenAI’s tiktoken, which approximates one token as ¾ of an English word. The model processes these tokens sequentially and produces new tokens one by one.

Audio Tokenization
Digitally, an audio signal is described as a continuous waveform of sound pressure over time. Audio tokenization transforms continuous sound waves into discrete representations, or tokens, that sequence models can interpret.
Some techniques include:
1. Phoneme/Character Tokens (Automatic Speech Recognition) which converts spoken language into text by transforming audio signals into discrete tokens such as phonemes, characters, or words.

2. Codec Tokens (Neural Audio Codecs) like EnCodec (Meta’s) and SoundStream (Google’s) turn audio into sequences of tokens without compromising quality, using vector quantization.
These representations align audio with the same discrete token-based processing, as text.

Image Generation
In digital context, an image is described as a collection of pixels. However, the image generation models don’t see an image as pixels, but as structured information that’s broken into smaller, meaningful representations. These compact representations are called tokens. They help the model understand patterns, textures, and semantics. Popular approaches:
Patch Embeddings: They split an image into uniform, non-overlapping patches, each represented as a token. These tokens act like words in a sentence, allowing the model to process visual information as a structured sequence of discrete units.
For a 224×224 image and 16×16 patch size, this gives (224/16)2=196 patches. Each becomes one token. These 196 tokens collectively describe the entire image, much like how words (tokens) describe meaning in a sentence.



Video Generation
The most common way to tokenize videos today is by breaking them into individual frames and pairing them with the corresponding audio. Each frame acts like an image token, while the audio provides temporal context.
For example, models like Gemini process video as a sequence of image tokens interwoven with text and audio information.

To summarize, tokens are the currency of AI. Whenever a company embeds AI into an application: document analysis, summarization, search, chat, content generation, or media processing. Every user interaction becomes a sequence of tokens in and tokens out.
The One-Token Model (OTM) quantifies the energy consumed by an LLM during inference and expresses its environmental impact on a per-token basis. It tells you the energy consumption of your token usage, and its consequent emissions.
The OTM works by observing the computation required to process each token, whether text, audio, image, or multimodal, and translating that computation into energy (kWh) and carbon emissions (gCO₂e). Because every interaction with an LLM ultimately reduces to tokens in and tokens out, the token becomes the most precise and universally comparable unit for evaluating the sustainability of AI usage. Frontier models such as those from OpenAI, Anthropic, Google, and Meta already operate internally using tokenization frameworks.

Although the word “token” appears at multiple layers of the AI stack, the One-Token Model unifies them into a coherent structure for measuring energy.
| Category | What They Are | Where Used | Example | OTM |
|---|---|---|---|---|
| Model/Text tokens (Core Level) | Smallest units processed or produced by LLMs | Inside GPT, Claude, Gemini, Llama | "Hello world!" → ["Hello", " world", "!"] | Primary unit for computing energy and CO₂ per token |
| API Billing Tokens | Units used by providers to meter usage | API dashboards, invoices | 2M tokens billed | Connect cost, usage, and energy |
| End-User Tokens | Hidden tokens exchanged during user interactions | Chat interfaces, enterprise apps | A 200-word prompt → 300 input + 400 output tokens | Enables per-user emissions reporting |
| Hardware-Level Tokens | Token-equivalent compute units derived from GPU telemetry | GPUs, TPUs, accelerators | 1,000 tokens → measurable watt-seconds | Links logical tokens to real-time power and carbon intensity |
| Training Tokens | Tokens used during model training | AI labs, lifecycle analysis | 1 trillion training tokens | Represents embodied emissions outside inference |
Example 1: Model/Text Tokens
basic implementation
When a user chats with ChatGPT, their message is broken into tokens that form the prompt. The model then generates a response using N tokens, which remain invisible to the user. The energy used for this interaction is measured and estimated.

Example 2: API/Billing Tokens
Developers using APIs from popular LLMs as well as AI powered coding IDEs can view token usage metrics, including the number of tokens exchanged (input, output, and total) per API call or interaction, through their respective dashboards or API responses.
A developer using OpenAI or Anthropic’s API receives telemetry showing:
500 input tokens + 400 output tokens = 900 tokens total.

The One-Token Model estimates the environmental impact of inference by quantifying the energy required to generate each token.
Example 3: End-User Tokens
A user writes a 200-word prompt in an enterprise chatbot.
Behind the scenes, the system logs:
300 input tokens processed + 400 output tokens generated.
Although the user never sees these tokens, the organisation can attribute emissions back to this specific user or workflow through OTM-based reporting.
Example 4: Hardware-Level Tokens
A data center runs an inference server where 1,000 logical tokens for a single request translate to a measurable watt-second profile on the GPU (power spikes, memory movement, cache hits). OTM converts these GPU telemetry traces into carbon intensity per token, enabling real-time hardware-aware emissions estimates.
Example 5: Training Tokens
A model developer reports that a new LLM was trained on 1 trillion tokens. OTM treats these as part of the model’s embodied emissions, distributing the training carbon footprint across all future inference tokens to reflect lifecycle impact, something end users and enterprises can now see in per-token calculations.
Inference impact is measured by observing how a model exercises compute resources during token generation. The model starts at thehardware layer, where energy is actually consumed, and progressively incorporates model-level and system-level factors to calculate the energy per token and the corresponding carbon emissions.
The model builds on a shared body of empirical evidence rather than any single source. It aligns with the direction of prior work such as EcoLogits, methodologies developed by Salesforce and Hugging Face, and independent research, while extending these ideas into a unified, systematic framework designed for practical implementation.
Existing studies provide measurement techniques, benchmarks, or partial approaches, the OTM consolidates these insights into an end-to-end method that connects hardware activity to model-level inference behavior and ultimately to energy and emissions attribution.
The One-Token Model separates the inference footprint into three interacting domains:
This multi-layered structure ensures that energy estimates reflect both physical resource usage and model-specific behavior.

GPUs dominate the power consumption of LLM inference.
While the GPUs typically dominate power usage during LLM inference, the rest of the server (CPU, memory, networking, cooling fans, etc.) which are basically the server power consumption (non-GPU Components) also consumes power. To represent this non-GPU overhead, the model includes a baseline server power component. A value of 1.2 kW is used as the default parameter, as it aligns with measurements reported across several sources, including for instance BoaviztaAPI specifications for inference-optimized systems such as AWS p5-class and comparable hyperscaler GPU nodes.
This value should not be interpreted as a universal constant; rather, it serves as a practical, evidence-supported approximation for high-performance inference servers. Where deployment-specific telemetry is available, particularly in on-premise or tightly controlled environments. This baseline can be adjusted to reflect measured operating conditions.
During inference, this baseline is treated as fully attributable to the workload. Thus, the energy consumed by non-GPU components is estimated as:
Server Power (kWh) = Inference Time × 1.2
This baseline accounts for idle power overhead and networking activity associated with serving the request.
The LLM architecture is one of the major factors that determine how many GPUs are required and how much memory the model occupies during inference.
The LLM model parameters focus on the internal structure and configuration of the language model itself.
We also include a Quantization Factor (Q) to account for the reduced numerical precision of the model weights. Quantization (e.g., FP16 or INT8 instead of FP32) significantly reduces the memory needed to store model parameters. Quantization reduces the precision of weights (e.g., from 32-bit to 8-bit), allowing more model parameters to fit in the same GPU memory. In production LLM deployments, quantization levels (e.g., FP16, INT8) are generally fixed for a model family. The reference for the quantization factor is from the AI energy score methodology by Hugging Face.
Active GPUs Used During Inference estimates how many GPUs are activated during inference, primarily based on the total number of model parameters, quantization, and GPU memory available. For open-source models, these values can be captured in real-time. But, because the quantization factor or precision of models is not publicly known for closed models like ChatGPT and Claude, we add an overhead -
Memory (GB)model = Memory (GB)LLM Inference x Overhead
The memory required for inference is:
Once we know the model’s memory footprint under the chosen precision,we can precisely determine how many GPUs are needed to host the model during inference. Modern large models often cannot fit into a single GPU’s memory, so they are sharded across multiple GPUs. We calculate the number of active GPUs required as follows:
The methodology incorporates three core latency measures:
These metrics determine the time window over which energy consumption is integrated.
The number of generated tokens directly affects the final per-token emissions. This is the denominator in the One-Token-Model calculation.
The energy consumption can be estimated using the formula:
Energy_kWh = [Server Energy (kWh) + (Active GPUs × GPU Power (kW) × GPU Utilization Rate) × Generation Latency Hours × Power Usage Effectiveness]
PUE (Power Usage Effectiveness) adjusts for cooling and infrastructure overhead.
Carbon emissions are obtained by multiplying energy consumption by grid carbon intensity:
CO22Total = Energy (kWh) × Grid Carbon Intensity (gCO2e/kWh)
Finally, emissions are normalized by token count:
To summarize, the OTM is powered by data collected at 3 levels:
| Provider Level | Hardware Level | Usage Level |
|---|---|---|
| Total Parameters | Number of GPUs | Number of tokens |
| Active Parameters | Active GPUs | Time taken to First Token (Reasoning period) |
| Quantization | GPU Utilization | Generation latency (Inference Time) |
| PUE | GPU Power Draw | |
| Batching | Server Power | |
| Caching | Memory Overhead | |
| Network Overhead |
Assumption: For proprietary models we estimate parameter counts based on performance heuristics and community reverse-engineering. We assume a standard quantization level (typically FP16 or INT8) where not disclosed.
Limitation: This method may introduce margins of error when providers update backend architectures. Additionally, OTM V1.0 focuses on text-to-text generation and does not yet account for the specific overhead of multimodal inputs (images/audio).
Assumption: The model currently isolates GPU power consumption, as GPUs represent the dominant energy sink during inference. Non-GPU components (CPU, RAM, Networking)) are aggregated into a fixed baseline server power overhead (defaulting to 1.2 kW based on hyperscaler specifications). We assume a minimum idle GPU utilization of 10%. In practice, GPU utilization cannot be 0% unless the device is fully powered off
Limitation: This creates a generic baseline that may not perfectly reflect custom on-premise hardware or alternative accelerators (like TPUs).
We are actively developing OTM Version 1.1 to address the limitations identified above. This next iteration expands the model's granularity and introduces a novel normalization layer.
Every LLM inference request unfolds in two distinct computational phases:
Empirical work, including EcoLogits, ML.ENERGY, AI Energy Score, and independent studies consistently show:
For this reason, V1.0 of the OTM attributes emissions primarily to output tokens, because they represent the section of inference where compute, and therefore energy, is concentrated. Focusing on output tokens provides an immediate, practical benefit: OTM can be applied to any single model without requiring cross-provider normalization.Within any individual model, output tokens provide a stable, consistent anchor for measurement because the tokenizer, architecture, and generation pathway all operate under unified internal logic.
We recognize that for practical measurements, many published methodologies sum these roles and report "energy per token" over theunion of input and output tokens, which creates a more hardware, and architecture-agnostic metric.
Attribution to output tokens alone, as in our OTM v1.0 approach, is a pragmatic simplification but it may undercount scenarios where large prompts (many input tokens) significantly affect energy use, this often matters for instruction-following or RAG scenarios. Counting only output tokens does make metrics portable across providers and models, but can miss model design differences that affect prefill.
There is a requirement for a consistent measurement unit that faithfully represents computational work. Output tokens are the starting point for energy attribution because they dominate inference compute. But as models evolve, input tokens, modality-specific processing tokens, and internal reasoning tokens will also need to be incorporated into a more complete accounting.
Modern AI usage rarely stays confined to one model or one provider. Organizations evaluate multiple options. Developers compare costs across OpenAI, Anthropic, and Google. Applications route requests dynamically based on workload characteristics.
For comprehensive cross-provider analysis, cost optimization, and fair performance benchmarking, the full spectrum of token types, and multi-modal differences across images, video, audio, and cross-provider, must be reconciled into a common measurement framework. For comprehensive cross-provider analysis, cost optimization, and fair performance benchmarking, the full spectrum of token types, and multi-modal differences across images, video, audio, and cross-provider, must be reconciled into a common token framework.
OTM is designed to be applied independently for each model, provider, and modality. This makes OTM immediately usable in practice. If a developer wants to understand the energy footprint of a prompt executed on GPT-5.1, OTM can measure it. If the same prompt is run on Anthropic Claude Opus, the model can be applied again, producing a separate, model-specific estimate. Each result stands on its own, in the same way that evaluation metrics such as latency, accuracy, or cost are typically compared today: model by model, test by test.
The OTM sometimes uses a simplifying assumption. For instance, for text prompts, approximately four words per token. This assumption is useful in contexts where exact token counts are not yet available, if for example, you are not using your own tokenizer, or because each model requires its own tokenizers, and different measurements for images, videos, and texts.
The OTM in its current version allows teams to reason about orders of magnitude without needing access to provider-specific tokenization.However, this is explicitly an approximation. In any serious deployment, the OTM is intended to operate on the actual token counts reported by each model or inferred from its tokenizer, rather than relying on a fixed 4:1 rule. The approximation can therefore be viewed as a pedagogical bridge, not as a core constraint of the model.
But organizations don't operate in single-model environments. They evaluate options, compare providers, mix modalities, and make purchasing decisions that require apples-to-apples comparisons.Evaluating cross-provider behavior still requires normalized tokens, across input tokens, and behind-the scenes processing tokens, and “output tokens”. Teams can apply the OTM repeatedly across providers and modalities, building a comparable view over time, model by model, use case by use case.
A core challenge in measuring AI inference impact is that one token is not a universal unit of computation within providers. Each AI provider, OpenAI, Anthropic, Google, Meta, Mistral, and others, uses its own tokenizer, vocabulary, and segmentation logic. As a result,identical text, image, audio, or video inputs can yield significantly different token counts across models, even when the underlying computational workload is similar.
| OpenAI (tiktoken) | ~4 Characters/Token |
| Anthropic | ~3.5 Characters/Token |
| Google (SentencePiece) | ~3.8 Characters/Token |
| Meta (Llama Tokenizer) | ~4 Characters/Token |
| Mistral (Custom BPE) | ~3.2 Characters/Token |
Tokenizers differ in vocabulary size, subword construction, merging rules, and handling of whitespace, punctuation, and code. As a direct consequence:
1 OpenAI token ≠ 1 Anthropic token ≠ Google token ≠ Nth provider token
Four problems arise:
To summarize, real enterprise workloads increasingly chain providers("Use Claude for summarization, GPT-4 for structured extraction"), mix modalities ("Analyze this video and generate a written report"), and route through different execution strategies (small-to-large model cascades, MoE expert selection, reasoning mode toggles). Current cost, usage, and energy frameworks cannot reconcile these heterogeneous token types into a common unit. Comparisons become unreliable for tasks like "Analyze this earnings call video and produce a summary" because you cannot meaningfully compare 15,000 video tokens + 500 text tokens in one system against 12,000 unified tokens in another.
To summarize, the OTM is designed around a simple principle: inference compute is overwhelmingly driven by output-generation steps, and output tokens provide a stable anchor for attributing energy and emissions. Within any individual model, this anchor is consistent because the tokenizer, architecture, and decoding pathway operate under a unified internal logic.
But current cost, usage, and energy frameworks do not reconcile these heterogeneous token types into a common unit. This lack of a standardized token unit is a foundational gap in cross-provider sustainability, benchmarking, and cost analysis.
But, the lack of a standardized token unit is a foundational gap in cross-provider sustainability, benchmarking, and cost analysis.
The Antarctica Token (AT) serves as this additional normalization layer.
It is defined as a normalized unit of LLM computational work, independent of provider tokenizer differences.
An Antarctica Token represents the standardized computational effort required to process or generate one unit of semantic content, normalized across tokenization methods, model architectures, languages, and modalities.
The Antarctica Token provides a:
Effective normalization requires conversion mappings for every major model provider and architecture pattern. The Antarctica Token framework maintains the most extensive database of provider-to-normalized token conversions currently available. This database includes:
The AT Framework works on a database built through systematic empirical testing rather than only theoretical assumptions. For each provider and model, the framework measures:

AI providers release new models and tokenizer updates continuously. Maintaining accurate normalization requires ongoing measurement and database updates. The Antarctica framework incorporates:
This operational infrastructure, the continuous process of measuring, validating, and updating conversion mappings, ensures the Antarctica Token remains a stable currency in a volatile ecosystem.
A normalized standard is only as valuable as its accessibility to the systems that need it. The theoretical rigor of the One-Token Model must be translated into actionable telemetry within real-world IT environments. These pathways allow organizations to ingest normalized metrics directly into their existing stacks, and apply the OTM seamlessly, regardless of whether they control the hardware or rely on third-party vendors. This brings us to the three architectural models for deployment.
The One-Token Model (OTM) can be deployed in different architectural environments depending on the level of access available to system telemetry and model internals. While the model achieves its highest accuracy when implemented directly at the provider or infrastructure layer, it remains effective in third-party API scenarios through a combination of statistical modelling, public hardware data, and latency observation. This section outlines three integration architectures and the data pathways through which OTM measurements are produced.
The API integration can be deployed in 3 types of Architecture:
ARCHITECTURE 1:
Provider-Side Integration (Full Telemetry Access)
In this configuration, the model is executed on infrastructure that the organisation directly controls, whether on a cloud GPU instance, on-premise servers, or self-hosted open-source LLM deployments (e.g., Llama, Mistral, Falcon, or any model served via frameworks such as vLLM or Hugging Face Inference Endpoints). Because the organisation manages both the model and the underlying compute, OTM can access complete hardware telemetry and derive high-resolution measurements.
Such cases are often seen in most companies who begin to add AI/ML modules leveraging open-source LLMs and fine-tune them following their business requirements.

Architecture 2:
Third-Party API Integration (Limited Telemetry)
This is a setup followed most commonly professionally and personally. In scenarios where models are accessed through closed model APIs(e.g., OpenAI, Anthropic, Google), detailed hardware metrics are not exposed. In these cases, precise metrics of hardware utilization, active parameters, latency, Time to first token, and other metrics are completely hidden from the end-user. AI becomes a black box for measurement.
OTM in this case, operates by combining publicly available model specifications, academic and industry research, observed inference timings, and statistical approximations derived from reference hardware profiles.

Architecture 3:
Hybrid Integration (Self-Hosted with Monitoring Extensions)
The architecture resembles Architecture 1. Here, the LLM is hosted on on-premise servers. For example: A client running Llama-3 on their own servers with Antarctica integration.

Here’s a 4 step approach to implementing:
Antarctica runs alongside your model like a co-pilot container in your Kubernetes deployment. It runs effectively as a "co-pilot," sharing the same compute context as the main inference container (e.g., vLLM or TGI) without interfering with the model’s critical path or latency.
Once deployed, the sidecar begins collecting live GPU telemetry. The sidecar interfaces directly with the host’s hardware drivers. Unlike external API estimates, this allows for the collection of ground-truth data. It continuously streams:
The Antarctica system connects directly to your model serving process. Wrap the model generation function to populate metrics such as:
Finally, the system answers with a structured output. Because the hardware is monitored directly, the output contains actuals rather than estimates.
The One-Token Model can be implemented at multiple layers of an AI system depending on the visibility available into usage, hardware, and model behavior. Broadly, OTM can be applied in three domains:usage-level analytics, hardware-level monitoring, and provider-level benchmarking.
Together, these domains allow OTM to support individual users, enterprises, and system operators in evaluating the computational, economic, and environmental implications of AI workloads.

Case 1: Usage
At the usage level, OTM quantifies how much computational work is performed in response to user interactions and translates this into energy and carbon metrics.
Individual Interactions - AIWattch
For individual users interacting with a chat-based LLM, OTM can measure the per-prompt impact of an inference event using output tokens as the primary unit. Tools built on top of the methodology (such as lightweight instrumentation layers or prompt-side extensions) can help users understand:
This enables end-users to make informed decisions about how they use AI systems.

Enterprise Usage
In organisational contexts, thousands of interactions accumulate into substantial computational footprints. When deployed within enterprise systems, OTM can aggregate usage across teams and roles to provide:

This supports internal observability into employee AI usage, compliance requirements, and responsible-use governance.
Case 2: Hardware
OTM connects usage to physical activity when hardware telemetry is available. Using GPU and server-level metrics, the model translates observed compute activity into:

This produces a direct mapping between computational work and environmental footprint, and gives a clear tokens/Co2 measurement in real-time across GPUs.
Case 3: Provider
Because API pricing is defined per token and each provider tokenizes differently, OTM enables consistent cross-model comparisons. By linking token generation to cost and computational effort, OTM allows providers and consumers to evaluate:
OTM helps you compare AI models on a consistent, real-time basis. So any claims of being a more sustainable AI provider can be easily validated in real-time with the OTM.
The application of the One-Token Model across usage, hardware, and provider layers enables organisations not only to measure impact but to translate those measurements into operational, economic, and environmental improvements. By exposing how efficiently tokens are produced, and used, whether in API-driven applications or self-hosted deployments, OTM provides the observability needed to guide optimization strategies. These strategies typically centre on three outcomes: cost reduction, performance efficiency, and emissions minimization.

Case Study: Improving Inference Efficiency in an Enterprise Deployment
A mid-sized organisation integrates an LLM-based assistant into its internal analytics platform. After an initial period of adoption, the engineering team observes that inference-related cloud costs and GPU activity are increasing at a rate disproportionate to the growth in user queries. To understand the source of the discrepancy, the organisation deploys the One-Token Model with the AT API, to monitor how much computational work is being performed per token and how efficiently that work is converted into user-visible responses.
Establishing a Baseline
Using OTM instrumentation, and the AT API, the team captures real-time metrics such as:
These measurements reveal the energy cost per token and highlight variations in efficiency across workloads and times of day. This baseline becomes the reference point for targeted interventions.
Targeted Optimization
Targeted Optimization with a clear view of the computation associated with each token, the organisation implements improvements along three dimensions:
Analysis shows that the system consumes approximately 0.002 kWh per output token. By adjusting model configurations, introducing modest batching during peak periods, and refining prompt structures to reduce unnecessary generation, the team reduces this to 0.0015 kWh per token. The improvement translates into a 25% reduction in monthly GPU-related energy expenditure.
OTM reveals that certain GPUs deliver significantly better performance-per-watt ratios for the same workload. The inference scheduler is updated to route requests dynamically toward the most efficient hardware, increasing effective throughput by roughly 12% and improving request latency without additional compute. At the usage level, better prompt engineering, and API usage is driven by insights provided by the OTM.
Lower power draw during inference allows the system to scale down inactive GPUs during off-peak hours. When combined with the cloud provider’s PUE characteristics, this reduces quarterly emissions by approximately 125 kg CO₂e. The reduction results not from offsetting but from structural efficiency gains .
As AI becomes more deeply embedded in products, workflows, and infrastructure, organisations need transparent and consistent ways to understand the computational, economic, and environmental consequences of their AI usage. The One-Token Model responds to this need by grounding measurement in the most consequential unit of inference, tokens, and linking that unit directly to the hardware activity, and provider architectures that drive energy consumption and emissions.
This whitepaper represents an effort in consolidation. We have synthesized insights from fragmented research, benchmarks, and disparate methodologies into a unified, systematic framework. By combining hardware-aware estimation with a normalized representation of token-level compute, the OTM establishes a common analytical layer for evaluating AI workloads across open-source, hybrid, and proprietary environments. This gives organisations a clearer basis for decisions related to model procurement, budgeting, capacity planning, and sustainability reporting.
At Antarctica, our core value is bringing radical transparency and measurable value to every company deploying AI. We believe that sustainability, FinOps, and operational efficiency are not opposing goals but shared outcomes. Our mission with the OTM is to make Sustainable AI actionable inside organizations, ensuring that the growing body of research is not merely theoretical, but is implemented to drive tangible impact.
When deployed in production systems, the OTM supports operational optimisation: reducing cost, improving performance efficiency, and lowering emissions without constraining capability. The OTM is a living standard. We are already actively developing Version 1.1, which addresses the limitations identified in this paper to provide an even more granular view of inference. V1.1 will expand the energy envelope to strictly account for CPU, RAM, and networking overhead, and will fully integrate the Antarctica Token, our normalization layer designed to make cross-provider benchmarking consistent.
The need for reliable, cross-provider measurement standards will only grow as AI becomes more ubiquitous and heterogeneous. We are now moving from consolidation to application. The One-Token Model represents a step toward that standardisation by aligning operational clarity with environmental responsibility, offering a practical and scientifically grounded path to understanding how modern AI systems consume resources and deliver value.
Let’s talk tokens
Discover how smarter token usage can lower your AI costs and footprint.