Antarctica • Digital Transformation, IT Cost Optimization & Sustainable IT Solutions

Contents

01 Introduction

02 The AI Lifecycle and The Role of a Token

03 The Ubiquitous Token

04 Deep Dive Into A Token

05 What is the Underlying One-Token Model

06 Normalizing Tokens For the OTM

07 The Antarctica Token

08 API Integrations for the One-Token Model

09 Applications of the One-Token Model

10 Optimization Using the One-Token Model

11 Conclusion

12 Sources

Contents

Introduction

Enterprise adoption of generative AI has expanded rapidly, with recent surveys indicating that approximately 90% of organizations have integrated AI into at least one workflow. Yet, despite this widespread uptake, most enterprises remain confined to exploratory or pilot-stage implementations. This limitation is not due to inadequacies in model capability, or technical abilities, but to the absence of a standardized, rigorous framework for quantifying computational work. Today, there is no dependable, foundational measurement framework for organizations to understand the impact of their AI investments.

In the absence of such a foundational metric, organizations are unable to evaluate efficiency, characterize model behavior, or establish reliable relationships between usage patterns, cost structures, and associated energy or emissions impact. The absence of this measurement foundation prevents enterprises from building predictable budgets, scaling workloads responsibly, and enforcing AI governance with confidence.

This gap shows up in three distinct ways:

1. Measurement of Usage

While leading AI model providers price by API call, token, or GPU hour, there is no industry-wide, deeply-accepted standard that allows organizations to truly understand and compare the compute effort or resource use behind different jobs or workflows.

2. Rising Costs of AI Usage

As models become larger and more complex, and as backend architectures (servers, batch sizes, mixtures of expert models) become more advanced, billing structures grow less transparent for the enterprise buyer. Organizations rarely receive detailed breakdowns of how their usage, prompt complexity, or model choice contribute to total compute cost. This makes budgeting unpredictable and optimization difficult.

3. The Environmental Impact of this Usage

AI’s energy use and carbon footprint are rarely transparent. Google’s disclosure that a median Gemini prompt uses 0.10 Wh and emits 0.02 gCO₂e is directionally useful. But a median value conceals the variability across prompts of different lengths, structures, and complexities, leaving organizations without insight into the full distribution of environmental impact.

The AI Lifecycle and The Role of a Token

Every meaningful action performed by an AI model today, whether understanding text, analyzing an image, interpreting audio, or generating a response, ultimately manifests as computation over tokens.

Tokens are the atomic units through which large language and multimodal models perceive, process, and produce information. They form the only universal unit that spans evaluation, inference, cost, hardware usage, and environmental impact.

During evaluation, models are tested with structured prompts to measure accuracy, coherence, and task performance. These tests also reveal how many tokens a model must process to achieve a given level of quality. When translated into energy, or cost per token, evaluation benchmarks become multidimensional, allowing organizations to compare not just accuracy but energy and cost efficiency across model versions or configurations.

However, the real impact emerges in inference. Google reported processing 1.3 quadrillion tokens monthly in 2025, a scale so large that raw token counts become abstract. The way to resolve this ambiguity is by translating token volume into quantifiable cost, usage, and energy consumption, turning statistically overwhelming numbers into operationally relevant metrics.

Provider	Reported Token volume (Monthly, 2025)	Notes/Source
Google (Gemini / DeepMind)	1.3 Quadrillion (1.3 x 10¹⁵)	Across all surfaces; doubled from 480 trillion in May to 980 trillion in July, reaching 1.3 quadrillion by summer.
Open AI (API + ChatGPT)	>259 Trillion (API only)	API at >6 billion tokens/min; total including ChatGPT estimated higher but not publicly detailed; 800 million weekly active users.
Microsoft (Azure OpenAI / Copilot)	1.7 Trillion (Foundry product)	Specific to Foundry; broader Copilot usage likely higher but no aggregate reported; quotas up to 32 billion for GPT-5 models.
Anthropic (Claude)	~25 Trillion (estimated)	25 billion API calls in Q2; assuming ~1,000 tokens per call; 30 million monthly active users.

The Ubiquitous Token

Tokens are the model's internal representation of meaning. Just as humans rely on words, models rely on tokens: discrete, structured units that encode inputs, resolve context, and output information. Because all computation happens on tokens, they become the only unit that measures four critical dimensions:

Operational: Throughput (tokens/sec), latency per token
Economic: Pricing models are entirely token-based
Sustainability: Energy and emissions scale with token processing
Hardware Efficiency: Power draw per token reflects GPU and memory behavior

By late 2025, a growing body of practice and research places tokens at the centre of how AI is measured, priced, and optimised. Providers increasingly expose token-based limits, routing rules, and pricing tiers. New hardware generations such as Blackwell, MI300, and Gaudi make token-level behaviour far easier to observe through metrics like tokens per second, per watt, and per joule.

This direction is echoed in the Stanford AI Index 2025, which emphasises token-normalised benchmarks for comparing inference cost, efficiency, and carbon intensity. They highlight a substantial reduction in inference costs since 2022, now commonly measured in token units, and encourage hybrid evaluation that pairs token usage with actual outputs. Complementary research on token efficiency, such as the Token Length Control with Dynamic Rewards (TLDR)-style dynamic reward shaping, demonstrates that substantial reductions in token usage are possible without affecting accuracy, particularly for reasoning and maths-heavy tasks.

At the application layer, similar patterns appear in how enterprises design and operate AI products. Teams increasingly treat token usage as a KPI, budgeting and allocating costs in tokens, prompt engineering in RAG pipelines, agent orchestration, caching, and session management. Sparkco AI’s 2025 analysis illustrates this shift, documenting 30–40% token reductions in real deployments through retrieval optimization, pruning, batching, and improved memory management. Emerging frameworks also assess the efficiency of tokenization itself.

These developments reflect a broader move toward understanding how much compute, cost, and energy each token represents.

Tokens in the Context of Providers

Model providers, across closed, and open-source releases, price their APIs (OpenAI, Gemini, Anthropic, Deepseek and others) exclusively in terms of tokens, distinguishing between input tokens (what the user sends into the model) and output tokens (what the model generates) and often adding separate rates for cached/context tokens.. This differentiation of input, output, and context-extension tokens is because each carries a different computational footprint, from attention cost to KV-cache pressure, driving more granular cost models, including surcharges for long contexts or discounted rates for efficient batching and (in the future) low-carbon regions.

Real-time token telemetry is now standard. API users receive real-time token counts, burn-rate signals, and “token waste” diagnostics, enabling prompt optimization, throttling, and model switching. Token behaviour also shapes modern inference scheduling. Modern serving systems: multi-model routing, mixtures of experts, and chip-aware orchestrators use token-arrival rates and tokens-per-joule measurements to size batches, route workloads, andmaintain SLAs across large fleets. Providers additionally, under pressure from the community have begun to disclose energy-per-token and carbon-per-token metrics (see Google’s per token emissions analysis).

In the open-source ecosystem, token-normalised benchmarks (tokens per task, latency per token, and energy per token), are now common across multilingual and multimodal evaluations. (see Hugging Face’s AI Energy Score)

Together, these practices make tokens the provider’s operational reference point for pricing, routing, efficiency, and sustainability.

2025 Token Economics at a Glance

Model	Input Price (per 1M tokens)	Output Price (per 1M tokens)
ChatGPT 5.1	$1.25	$10.00
Claude Sonnet 4.5	$3	$15 / MTok
Gemini 3 Pro preview	$1.2, prompts <= 200k tokens $4, prompts > 200k tokens	$10.00, prompts <= 200k tokens $15.00, prompts > 200k tokens
Llama 4 Maverick (Meta)	$0.19-$0.49 (3:1 blended)	$0.19-$0.49 (3:1 blended)
Grok 4.1 Fast	$0.20	$0.50
DeepSeek	$0.028-0.28	$0.42
Mistral Medium 3	$0.4	$2.0

Tokens in the Context of Hardware

At an infrastructure level, every token processed by a model triggers real, measurable work on the accelerator, moving data through memory, running transformer blocks, hitting or missing caches, and drawing power. New GPU and accelerator stacks now expose per-token telemetry, reporting how much bandwidth, cache activity, heat, and power each segment of computation uses. Cloud and on-prem orchestration systems collect this data into live dashboards and sustainability reports, giving operators a detailed view of the physical cost of each token.

This level of visibility has reshaped tooling. OpenTelemetry extensions now treat tokens as first-class units, and FinOps teams combine cost, power and workload metrics to calculate tokens-per-joule, and cost, carbon-per-prompt. These metrics feed internal dashboards, SLAs, and even customer billing. Green routing frameworks (for example, GreenPT’s green router) help to choose the best model for each request, shifting workloads to cleaner regions or delaying inference when the grid is under stress.

Multimodal models add nuance: text tokens, image patches, and audio segments run through different paths, so comparisons often use normalised semantic units or composite efficiency scores to reflect equivalent work. At the same time, operators increasingly attribute part of the hardware’s lifecycle (Scope 3) emissions to inference, giving a fuller picture of carbon intensity. These capabilities are no longer limited to major hyperscalers. Edge devices and local GPUs now ship with SDKs that report tokens-per-joule or carbon-per-prompt directly to end users.

Tracking per-token metrics at the hardware layer enables operators to:

Measure the exact physical work associated with each token: memory movement, transformer execution, cache behaviour, and power draw.
Monitor real-time bandwidth use, thermal activity, and power spikes via per-token telemetry from modern GPU and accelerator stacks.
Calculate core efficiency metrics such as tokens-per-joule, carbon-per-prompt, and tokens-per-cooling-watt for operational and sustainability analysis.
Drive intelligent routing decisions by selecting chips or models based on per-token cost, latency, and carbon signals.
Shift workloads to lower-carbon regions or automatically pause inference during grid stress conditions.
Compare efficiency across modalities (text, image, audio) using normalised semantic units or composite efficiency scores.
Incorporate hardware lifecycle (Scope 3) emissions into per-token accounting for a fuller view of total carbon intensity.
Extend transparency to edge and local environments through SDKs that expose tokens-per-joule or carbon-per-prompt on consumer GPUs and devices.

Tokens in the Context of Users

In the context of tokens, “users” primarily refers to the teams, products, and organizations that consume model capacity through APIs or embedded AI workflows, not just individual end-consumers typing into a chat interface.

Enterprise and developer users receive detailed token telemetry from providers, allowing them to allocate compute costs across departments or features, optimise prompts and RAG pipelines, and monitor energy and performance per API call. For SaaS builders and product managers, token consumption directly shapes the economics of their products, even if their customers only see high-level abstractions like “queries processed” or “documents analysed.”

Professional and power users, such as those relying on GitHub Copilot or AI productivity tools, sometimes interact with token limits indirectly through quotas or usage tiers. For them, “fewer tokens” can translate into staying within plan limits or achieving faster interactions, but not always into direct cost savings. In contrast, casual consumer users (like someone using ChatGPT or Copilot on a personal plan) rarely see tokens at all; their experience is governed by fixed envelopes, input caps, or rate limits rather than per-token economics or sustainability benefits. In enterprise environments, internal employees or client users typically never manage tokens directly, but enterprise IT and AI managers do, feeding token metrics into billing, reporting, optimisation, and environmental dashboards.

Across these contexts, tokens matter as an operational, economic, and sustainability metric only for the users who are responsible for or billed by their usage. For others, the effects are indirect: efficiency at the organisational level improves speed, reliability, and sustainability downstream.

To summarize, tracking per token metrics enables users to:

Allocate compute costs across departments, teams, features, or products with granular accuracy.
Monitor latency, throughput, and energy usage on a per-API-call basis.
Optimise prompts, RAG pipelines, agent flows, and caching to reduce unnecessary token generation.
Compare models, chips, and routing decisions using normalised efficiency metrics (e.g., tokens-per-joule, carbon-per-prompt).
Manage quotas, usage tiers, and plan limits for professional or power-user scenarios.
Reveal sustainability and cost insights in enterprise dashboards, green SLAs, or internal reporting.
Translate token-level behaviour into product-level decisions for SaaS builders and application providers.

Deep-Dive into a Token

Across modalities, whether the AI model is reading text, analyzing an image, or listening to audio, it always breaks the input into small, understandable pieces called tokens.

Text Tokenization

Text is decomposed into small units: characters, sub-words, or words, through tokenizers such as OpenAI’s tiktoken, which approximates one token as ¾ of an English word. The model processes these tokens sequentially and produces new tokens one by one.

Audio Tokenization

Digitally, an audio signal is described as a continuous waveform of sound pressure over time. Audio tokenization transforms continuous sound waves into discrete representations, or tokens, that sequence models can interpret.

Some techniques include:

1. Phoneme/Character Tokens (Automatic Speech Recognition) which converts spoken language into text by transforming audio signals into discrete tokens such as phonemes, characters, or words.

2. Codec Tokens (Neural Audio Codecs) like EnCodec (Meta’s) and SoundStream (Google’s) turn audio into sequences of tokens without compromising quality, using vector quantization.

These representations align audio with the same discrete token-based processing, as text.

Image Generation

In digital context, an image is described as a collection of pixels. However, the image generation models don’t see an image as pixels, but as structured information that’s broken into smaller, meaningful representations. These compact representations are called tokens. They help the model understand patterns, textures, and semantics. Popular approaches:

Patch Embeddings: They split an image into uniform, non-overlapping patches, each represented as a token. These tokens act like words in a sentence, allowing the model to process visual information as a structured sequence of discrete units.
For a 224×224 image and 16×16 patch size, this gives (224/16)2=196 patches. Each becomes one token. These 196 tokens collectively describe the entire image, much like how words (tokens) describe meaning in a sentence.

Discrete Variational Auto Encoder (DVAE) and Vector Quantization: Think of vector quantization as turning image features into predefined buckets. Each bucket stores a representative vector, and the DVAE maps image parts to these buckets to learn consistent, discrete patterns.

CLIP-Style Contrastive Embeddings: Contrastive models like CLIP learn to align images and text in the same feature space. Each image or caption is converted into an embedding, and these embeddings can then be grouped or discretized into token-like units for other tasks.

Video Generation

The most common way to tokenize videos today is by breaking them into individual frames and pairing them with the corresponding audio. Each frame acts like an image token, while the audio provides temporal context.

For example, models like Gemini process video as a sequence of image tokens interwoven with text and audio information.

To summarize, tokens are the currency of AI. Whenever a company embeds AI into an application: document analysis, summarization, search, chat, content generation, or media processing. Every user interaction becomes a sequence of tokens in and tokens out.

What is Underlying the One-Token Model?

The One-Token Model (OTM) quantifies the energy consumed by an LLM during inference and expresses its environmental impact on a per-token basis. It tells you the energy consumption of your token usage, and its consequent emissions.

The OTM works by observing the computation required to process each token, whether text, audio, image, or multimodal, and translating that computation into energy (kWh) and carbon emissions (gCO₂e). Because every interaction with an LLM ultimately reduces to tokens in and tokens out, the token becomes the most precise and universally comparable unit for evaluating the sustainability of AI usage. Frontier models such as those from OpenAI, Anthropic, Google, and Meta already operate internally using tokenization frameworks.

Although the word “token” appears at multiple layers of the AI stack, the One-Token Model unifies them into a coherent structure for measuring energy.

Category	What They Are	Where Used	Example	OTM
Model/Text tokens (Core Level)	Smallest units processed or produced by LLMs	Inside GPT, Claude, Gemini, Llama	"Hello world!" → ["Hello", " world", "!"]	Primary unit for computing energy and CO₂ per token
API Billing Tokens	Units used by providers to meter usage	API dashboards, invoices	2M tokens billed	Connect cost, usage, and energy
End-User Tokens	Hidden tokens exchanged during user interactions	Chat interfaces, enterprise apps	A 200-word prompt → 300 input + 400 output tokens	Enables per-user emissions reporting
Hardware-Level Tokens	Token-equivalent compute units derived from GPU telemetry	GPUs, TPUs, accelerators	1,000 tokens → measurable watt-seconds	Links logical tokens to real-time power and carbon intensity
Training Tokens	Tokens used during model training	AI labs, lifecycle analysis	1 trillion training tokens	Represents embodied emissions outside inference

Example 1: Model/Text Tokens

basic implementation

When a user chats with ChatGPT, their message is broken into tokens that form the prompt. The model then generates a response using N tokens, which remain invisible to the user. The energy used for this interaction is measured and estimated.

Example 2: API/Billing Tokens

Developers using APIs from popular LLMs as well as AI powered coding IDEs can view token usage metrics, including the number of tokens exchanged (input, output, and total) per API call or interaction, through their respective dashboards or API responses.

A developer using OpenAI or Anthropic’s API receives telemetry showing:

500 input tokens + 400 output tokens = 900 tokens total.

The One-Token Model estimates the environmental impact of inference by quantifying the energy required to generate each token.

Example 3: End-User Tokens

A user writes a 200-word prompt in an enterprise chatbot.
Behind the scenes, the system logs:

300 input tokens processed + 400 output tokens generated.
Although the user never sees these tokens, the organisation can attribute emissions back to this specific user or workflow through OTM-based reporting.

Example 4: Hardware-Level Tokens

A data center runs an inference server where 1,000 logical tokens for a single request translate to a measurable watt-second profile on the GPU (power spikes, memory movement, cache hits). OTM converts these GPU telemetry traces into carbon intensity per token, enabling real-time hardware-aware emissions estimates.

Example 5: Training Tokens

A model developer reports that a new LLM was trained on 1 trillion tokens. OTM treats these as part of the model’s embodied emissions, distributing the training carbon footprint across all future inference tokens to reflect lifecycle impact, something end users and enterprises can now see in per-token calculations.

Overview of the Methodology

Inference impact is measured by observing how a model exercises compute resources during token generation. The model starts at thehardware layer, where energy is actually consumed, and progressively incorporates model-level and system-level factors to calculate the energy per token and the corresponding carbon emissions.

The model builds on a shared body of empirical evidence rather than any single source. It aligns with the direction of prior work such as EcoLogits, methodologies developed by Salesforce and Hugging Face, and independent research, while extending these ideas into a unified, systematic framework designed for practical implementation.

Existing studies provide measurement techniques, benchmarks, or partial approaches, the OTM consolidates these insights into an end-to-end method that connects hardware activity to model-level inference behavior and ultimately to energy and emissions attribution.

The One-Token Model separates the inference footprint into three interacting domains:

Hardware Utilization (GPUs and non-GPU server components)
Model Architecture (parameter count, active parameters, quantization)
Inference Dynamics (latency, throughput, output tokens generated)

This multi-layered structure ensures that energy estimates reflect both physical resource usage and model-specific behavior.

Hardware Measurement Layer

GPUs dominate the power consumption of LLM inference.

The Installed GPUs account for all the number of GPUs that are provisioned.
Active GPUs are the GPUs required to host and execute the model during inference.
GPU Power is the instantaneous power draw of each GPU during token generation.
GPU Utilization is the fraction of computational capacity used during inference. Utilization acts as a scaling factor on power draw and inference time.
GPU Memory Usage (GB) is the VRAM required to load active model parameters, activations, and intermediate tensors.

While the GPUs typically dominate power usage during LLM inference, the rest of the server (CPU, memory, networking, cooling fans, etc.) which are basically the server power consumption (non-GPU Components) also consumes power. To represent this non-GPU overhead, the model includes a baseline server power component. A value of 1.2 kW is used as the default parameter, as it aligns with measurements reported across several sources, including for instance BoaviztaAPI specifications for inference-optimized systems such as AWS p5-class and comparable hyperscaler GPU nodes.

This value should not be interpreted as a universal constant; rather, it serves as a practical, evidence-supported approximation for high-performance inference servers. Where deployment-specific telemetry is available, particularly in on-premise or tightly controlled environments. This baseline can be adjusted to reflect measured operating conditions.

During inference, this baseline is treated as fully attributable to the workload. Thus, the energy consumed by non-GPU components is estimated as:

Server Power (kWh) = Inference Time × 1.2

This baseline accounts for idle power overhead and networking activity associated with serving the request.

Model Architecture Layer

The LLM architecture is one of the major factors that determine how many GPUs are required and how much memory the model occupies during inference.

The LLM model parameters focus on the internal structure and configuration of the language model itself.

Total Parameters (B): The total number of trainable parameters in the model, measured in billions. This represents the model’s theoretical size.
Active Parameters (B): The subset of parameters actually used during inference. Architectures such as Mixture-of-Experts activate only a fraction of total parameters per token, affecting both memory usage and power draw.

We also include a Quantization Factor (Q) to account for the reduced numerical precision of the model weights. Quantization (e.g., FP16 or INT8 instead of FP32) significantly reduces the memory needed to store model parameters. Quantization reduces the precision of weights (e.g., from 32-bit to 8-bit), allowing more model parameters to fit in the same GPU memory. In production LLM deployments, quantization levels (e.g., FP16, INT8) are generally fixed for a model family. The reference for the quantization factor is from the AI energy score methodology by Hugging Face.

Active GPUs Used During Inference estimates how many GPUs are activated during inference, primarily based on the total number of model parameters, quantization, and GPU memory available. For open-source models, these values can be captured in real-time. But, because the quantization factor or precision of models is not publicly known for closed models like ChatGPT and Claude, we add an overhead -

Memory (GB)_model = Memory (GB)_{LLM Inference} x Overhead

The memory required for inference is:

Memory (GB)_{LLM Inference} =

[Total Parameters (B) × Bytes Per Parameter]

Memory (GB)_{LLM Inference}=

[Total Parameters (B) × Bytes Per Parameter]

32 / Q

Once we know the model’s memory footprint under the chosen precision,we can precisely determine how many GPUs are needed to host the model during inference. Modern large models often cannot fit into a single GPU’s memory, so they are sharded across multiple GPUs. We calculate the number of active GPUs required as follows:

Active GPU(s) = ⌈

Memory_model

Memory_GPU

⌉

Inference Dynamics Layer

The methodology incorporates three core latency measures:

Time to First Output Token: The latency or delay before the model produces its first word (token) after receiving input. Lower is better for responsiveness.
Output Tokens per Second: The rate at which the model generates output tokens. It measures the throughput or generation speed.
Inference Time: The total time taken for the model to process input and produce output (complete response).

These metrics determine the time window over which energy consumption is integrated.

Output Token Count

The number of generated tokens directly affects the final per-token emissions. This is the denominator in the One-Token-Model calculation.

The energy consumption can be estimated using the formula:

Energy_kWh = [Server Energy (kWh) + (Active GPUs × GPU Power (kW) × GPU Utilization Rate) × Generation Latency Hours × Power Usage Effectiveness]

PUE (Power Usage Effectiveness) adjusts for cooling and infrastructure overhead.

Carbon Emissions Calculation

Carbon emissions are obtained by multiplying energy consumption by grid carbon intensity:

CO2₂Total = Energy (kWh) × Grid Carbon Intensity (gCO₂e/kWh)

Finally, emissions are normalized by token count:

CO₂e Per Token =

Co₂e Total

Output Tokens

To summarize, the OTM is powered by data collected at 3 levels:

Provider Level	Hardware Level	Usage Level
Total Parameters	Number of GPUs	Number of tokens
Active Parameters	Active GPUs	Time taken to First Token (Reasoning period)
Quantization	GPU Utilization	Generation latency (Inference Time)
PUE	GPU Power Draw
Batching	Server Power
Caching	Memory Overhead
	Network Overhead

Limitations

1. On Models

Assumption: For proprietary models we estimate parameter counts based on performance heuristics and community reverse-engineering. We assume a standard quantization level (typically FP16 or INT8) where not disclosed.

Limitation: This method may introduce margins of error when providers update backend architectures. Additionally, OTM V1.0 focuses on text-to-text generation and does not yet account for the specific overhead of multimodal inputs (images/audio).

2. On Hardware

Assumption: The model currently isolates GPU power consumption, as GPUs represent the dominant energy sink during inference. Non-GPU components (CPU, RAM, Networking)) are aggregated into a fixed baseline server power overhead (defaulting to 1.2 kW based on hyperscaler specifications). We assume a minimum idle GPU utilization of 10%. In practice, GPU utilization cannot be 0% unless the device is fully powered off

Limitation: This creates a generic baseline that may not perfectly reflect custom on-premise hardware or alternative accelerators (like TPUs).

We are actively developing OTM Version 1.1 to address the limitations identified above. This next iteration expands the model's granularity and introduces a novel normalization layer.

Expanding the Energy Envelope. V1.1 moves beyond the fixed server baseline to calculate a granular summation of all components. Experimental data suggests CPU and RAM usage can add ~30% to GPU energy use. This will also include networking overhead derived from independent research.
We also aim to improve on the GPU power draw at different utilisation rates. We are exploring a method to capture scaling power from utilisation and TDP. In practice, GPUs typically draw only about 75–80% of their rated TDP under real full-load conditions. However, this can be mitigated for open-source LLMs hosted on on-prem servers, where precise power measurements are possible.
A core challenge in V1.0 is that a "token" is not a standard unit; different providers (OpenAI, Anthropic, Google) use different tokenizers, making "per-token" comparisons inexact. Different modalities process tokens differently. To solve this, we are introducing the Antarctica Token (see the next section).

Normalizing Tokens For the OTM

Inference Phases and Why OTM Focuses only on Output Tokens

Every LLM inference request unfolds in two distinct computational phases:

Prefill (Input Processing): When you submit a prompt, the model first processes all input tokens together in a single forward pass. During this phase, the model builds an internal representation of your request by populating what's called the KV (Key-Value) cache. This phase determines how long you wait before seeing the first word of the response: the "Time to First Token" (TTFT).
Generation (Output Tokens): After prefill completes, the model begins generating output tokens one at a time. Unlike prefill, each output token requires a complete autoregressive pass through the entire model. The model must reference everything it has processed so far, apply its full reasoning capacity, and produce exactly one new token. Then it repeats this process for the next token, and the next, until the response is complete.

Empirical work, including EcoLogits, ML.ENERGY, AI Energy Score, and independent studies consistently show:

For many real-world workloads the decode phase (reasoning plus answer generation) often accounts for the clear majority of inference compute, with compute dominated by how long the model ‘thinks’ and how many output tokens it produces.
Prefill dominates only in very short outputs.
For many real-world LLM workloads with non-trivial answers, total compute tends to correlate more strongly with the length of the generated output than with the length of the input, although very long prompts or very short outputs can shift the balance toward input-side prefill

For this reason, V1.0 of the OTM attributes emissions primarily to output tokens, because they represent the section of inference where compute, and therefore energy, is concentrated. Focusing on output tokens provides an immediate, practical benefit: OTM can be applied to any single model without requiring cross-provider normalization.Within any individual model, output tokens provide a stable, consistent anchor for measurement because the tokenizer, architecture, and generation pathway all operate under unified internal logic.

Acknowledging the Limitation

We recognize that for practical measurements, many published methodologies sum these roles and report "energy per token" over theunion of input and output tokens, which creates a more hardware, and architecture-agnostic metric.

Attribution to output tokens alone, as in our OTM v1.0 approach, is a pragmatic simplification but it may undercount scenarios where large prompts (many input tokens) significantly affect energy use, this often matters for instruction-following or RAG scenarios. Counting only output tokens does make metrics portable across providers and models, but can miss model design differences that affect prefill.

There is a requirement for a consistent measurement unit that faithfully represents computational work. Output tokens are the starting point for energy attribution because they dominate inference compute. But as models evolve, input tokens, modality-specific processing tokens, and internal reasoning tokens will also need to be incorporated into a more complete accounting.
Modern AI usage rarely stays confined to one model or one provider. Organizations evaluate multiple options. Developers compare costs across OpenAI, Anthropic, and Google. Applications route requests dynamically based on workload characteristics.

The Path Forward

For comprehensive cross-provider analysis, cost optimization, and fair performance benchmarking, the full spectrum of token types, and multi-modal differences across images, video, audio, and cross-provider, must be reconciled into a common measurement framework. For comprehensive cross-provider analysis, cost optimization, and fair performance benchmarking, the full spectrum of token types, and multi-modal differences across images, video, audio, and cross-provider, must be reconciled into a common token framework.

A Structural Challenge in Enterprise Tokenomics

OTM is designed to be applied independently for each model, provider, and modality. This makes OTM immediately usable in practice. If a developer wants to understand the energy footprint of a prompt executed on GPT-5.1, OTM can measure it. If the same prompt is run on Anthropic Claude Opus, the model can be applied again, producing a separate, model-specific estimate. Each result stands on its own, in the same way that evaluation metrics such as latency, accuracy, or cost are typically compared today: model by model, test by test.

The OTM sometimes uses a simplifying assumption. For instance, for text prompts, approximately four words per token. This assumption is useful in contexts where exact token counts are not yet available, if for example, you are not using your own tokenizer, or because each model requires its own tokenizers, and different measurements for images, videos, and texts.

The OTM in its current version allows teams to reason about orders of magnitude without needing access to provider-specific tokenization.However, this is explicitly an approximation. In any serious deployment, the OTM is intended to operate on the actual token counts reported by each model or inferred from its tokenizer, rather than relying on a fixed 4:1 rule. The approximation can therefore be viewed as a pedagogical bridge, not as a core constraint of the model.

But organizations don't operate in single-model environments. They evaluate options, compare providers, mix modalities, and make purchasing decisions that require apples-to-apples comparisons.Evaluating cross-provider behavior still requires normalized tokens, across input tokens, and behind-the scenes processing tokens, and “output tokens”. Teams can apply the OTM repeatedly across providers and modalities, building a comparable view over time, model by model, use case by use case.

A core challenge in measuring AI inference impact is that one token is not a universal unit of computation within providers. Each AI provider, OpenAI, Anthropic, Google, Meta, Mistral, and others, uses its own tokenizer, vocabulary, and segmentation logic. As a result,identical text, image, audio, or video inputs can yield significantly different token counts across models, even when the underlying computational workload is similar.

OpenAI (tiktoken)	~4 Characters/Token
Anthropic	~3.5 Characters/Token
Google (SentencePiece)	~3.8 Characters/Token
Meta (Llama Tokenizer)	~4 Characters/Token
Mistral (Custom BPE)	~3.2 Characters/Token

Tokenizers differ in vocabulary size, subword construction, merging rules, and handling of whitespace, punctuation, and code. As a direct consequence:

1 OpenAI token ≠ 1 Anthropic token ≠ Google token ≠ Nth provider token

Four problems arise:

Semantic Non-Equivalence: The same input text produces different token counts, even though the model is performing comparable semantic work.
Computational Non-Equivalence: Providers encode different amounts of computation per token. For example, some models allocate more attention or memory per token due to architectural choices.
Modern frontier models operate beyond text:
- Images: patch tokens, DVAE tokens, VQ tokens
- Audio: codec tokens (EnCodec, SoundStream) or phoneme tokens
- Video: tokens per frame, per patch, plus fused audio tokens
- Long reasoning: hidden intermediate tokens
- Tool use: additional model-internal tokens
Each modality has its own encoding logic, so the computational weight of a "token" varies widely across
- Model families
- Modality types
- Reasoning vs non-reasoning paths
- Provider-specific middleware

To summarize, real enterprise workloads increasingly chain providers("Use Claude for summarization, GPT-4 for structured extraction"), mix modalities ("Analyze this video and generate a written report"), and route through different execution strategies (small-to-large model cascades, MoE expert selection, reasoning mode toggles). Current cost, usage, and energy frameworks cannot reconcile these heterogeneous token types into a common unit. Comparisons become unreliable for tasks like "Analyze this earnings call video and produce a summary" because you cannot meaningfully compare 15,000 video tokens + 500 text tokens in one system against 12,000 unified tokens in another.

To summarize, the OTM is designed around a simple principle: inference compute is overwhelmingly driven by output-generation steps, and output tokens provide a stable anchor for attributing energy and emissions. Within any individual model, this anchor is consistent because the tokenizer, architecture, and decoding pathway operate under a unified internal logic.
But current cost, usage, and energy frameworks do not reconcile these heterogeneous token types into a common unit. This lack of a standardized token unit is a foundational gap in cross-provider sustainability, benchmarking, and cost analysis.

But, the lack of a standardized token unit is a foundational gap in cross-provider sustainability, benchmarking, and cost analysis.

The Antarctica Token

The Antarctica Token (AT) serves as this additional normalization layer.

It is defined as a normalized unit of LLM computational work, independent of provider tokenizer differences.

An Antarctica Token represents the standardized computational effort required to process or generate one unit of semantic content, normalized across tokenization methods, model architectures, languages, and modalities.

The Antarctica Token provides a:

1. Provider-Agnostic Equivalence

Effective normalization requires conversion mappings for every major model provider and architecture pattern. The Antarctica Token framework maintains the most extensive database of provider-to-normalized token conversions currently available. This database includes:

All major closed-source providers: OpenAI, Google, Anthropic, Meta, xAI, and others.
Leading open-source model families: Mistral, DeepSeek, and others.

2. Computational Grounding

The AT Framework works on a database built through systematic empirical testing rather than only theoretical assumptions. For each provider and model, the framework measures:

Full-Cycle Token Accounting tracks total throughput from ingestion to generation, ensuring complete visibility into cost and usage across the entire lifecycle.
Architectural Resource Profiling analyzes underlying model characteristics and computational weight to optimize performance allocation without manual tuning.
Adaptive Compute Pathways differentiates between standard processing and complex logic flows, routing requests efficiently based on required cognitive load.
Unified Multimodal Abstraction standardizes the consumption and generation of diverse media types into a single, cohesive accounting layer, regardless of format.
Semantic Density Evaluation assesses the information richness of the payload to adjust processing expectations based on the complexity of the content.
Comparative Infrastructure Benchmarking provides contextual performance data against market standards to validate efficiency and cost-effectiveness.

3. Continuous Database Expansion

AI providers release new models and tokenizer updates continuously. Maintaining accurate normalization requires ongoing measurement and database updates. The Antarctica framework incorporates:

Automated monitoring of new model releases
Rapid characterization of new tokenizers and architectures
Backward compatibility maintenance as older models are deprecated
Quality assurance through cross-validation against known workloads, and multiple data quality checks.

This operational infrastructure, the continuous process of measuring, validating, and updating conversion mappings, ensures the Antarctica Token remains a stable currency in a volatile ecosystem.

A normalized standard is only as valuable as its accessibility to the systems that need it. The theoretical rigor of the One-Token Model must be translated into actionable telemetry within real-world IT environments. These pathways allow organizations to ingest normalized metrics directly into their existing stacks, and apply the OTM seamlessly, regardless of whether they control the hardware or rely on third-party vendors. This brings us to the three architectural models for deployment.

API Integrations for the One-Token Model

The One-Token Model (OTM) can be deployed in different architectural environments depending on the level of access available to system telemetry and model internals. While the model achieves its highest accuracy when implemented directly at the provider or infrastructure layer, it remains effective in third-party API scenarios through a combination of statistical modelling, public hardware data, and latency observation. This section outlines three integration architectures and the data pathways through which OTM measurements are produced.

The API integration can be deployed in 3 types of Architecture:

ARCHITECTURE 1

Provider-Side
(Full Access to the resource)

ARCHITECTURE 2

Third-Party API
(Limited Access to the resource)

ARCHITECTURE 3

Hybrid

ARCHITECTURE 1:

Provider-Side Integration (Full Telemetry Access)

In this configuration, the model is executed on infrastructure that the organisation directly controls, whether on a cloud GPU instance, on-premise servers, or self-hosted open-source LLM deployments (e.g., Llama, Mistral, Falcon, or any model served via frameworks such as vLLM or Hugging Face Inference Endpoints). Because the organisation manages both the model and the underlying compute, OTM can access complete hardware telemetry and derive high-resolution measurements.

Such cases are often seen in most companies who begin to add AI/ML modules leveraging open-source LLMs and fine-tune them following their business requirements.

Request Capture: When a user request arrives, the system records the timestamp, model identifier, input size, and initial system state. This establishes the beginning of the inference interval.
Pre-Inference Setup: Before decoding begins, the system activates GPU monitoring and records baseline power, active GPU count, and relevant model configuration (e.g., quantization level, parameter sharding).
Real-Time Inference Monitoring: During token generation, the system samples GPU utilisation, power draw, active parameters, memory footprint, and token-level timing. These metrics form the core evidence of computational activity.
Post-Inference Aggregation: After the final output token is produced, the system aggregates telemetry into energy (kWh), carbon emissions (gCO₂e), and per-token impact measures.
Response with Metadata: The application returns the model’s output together with an attached metadata summary containing token usage, energy consumption, emissions, inference duration, and relevant hardware context.

Architecture 2:

Third-Party API Integration (Limited Telemetry)

This is a setup followed most commonly professionally and personally. In scenarios where models are accessed through closed model APIs(e.g., OpenAI, Anthropic, Google), detailed hardware metrics are not exposed. In these cases, precise metrics of hardware utilization, active parameters, latency, Time to first token, and other metrics are completely hidden from the end-user. AI becomes a black box for measurement.

OTM in this case, operates by combining publicly available model specifications, academic and industry research, observed inference timings, and statistical approximations derived from reference hardware profiles.

Model Reference Database: A database of model characteristics is maintained, including parameter counts, typical hardware deployments, active/total parameter ratios, GPU types, average utilisation, cloud-region PUE values, and other published specifications. This acts as a profile library for inference energy estimation.
Timestamp Capture: During an API call, three timestamps are recorded: request start, time to first token, and time to final token. These capture reasoning duration, generation time, and total inference interval.
Output Token Extraction: Output token counts are obtained directly from the API or estimated using an approximation aligned with the provider’s known tokenization method (e.g., ~4 characters per token for tiktoken) or if using the Antarctica Token, then real-time token analysis is conducted seamlessly.
Energy and Carbon Estimation: Using the reference database and measured latency, the model’s likely hardware profile is combined with PUE and grid intensity data to estimate energy consumption and resulting emissions.
Metadata Assembly: The system returns token usage, estimated energy and carbon impacts, and the assumptions used in producing the estimate, ensuring transparency and reproducibility.

Architecture 3:

Hybrid Integration (Self-Hosted with Monitoring Extensions)

The architecture resembles Architecture 1. Here, the LLM is hosted on on-premise servers. For example: A client running Llama-3 on their own servers with Antarctica integration.

Here’s a 4 step approach to implementing:

1. Deploy Antarctica Container as a Sidecar

Antarctica runs alongside your model like a co-pilot container in your Kubernetes deployment. It runs effectively as a "co-pilot," sharing the same compute context as the main inference container (e.g., vLLM or TGI) without interfering with the model’s critical path or latency.

2. Enable Real-Time Hardware Telemetry

Once deployed, the sidecar begins collecting live GPU telemetry. The sidecar interfaces directly with the host’s hardware drivers. Unlike external API estimates, this allows for the collection of ground-truth data. It continuously streams:

Instantaneous Power Draw: The exact wattage consumed by the specific GPUs assigned to the pod.
GPU Utilization: The precise compute load during the inference window.
Memory Bandwidth & VRAM: Real-time memory pressure metrics.

3. Integrate Model Tracking

The Antarctica system connects directly to your model serving process. Wrap the model generation function to populate metrics such as:

Number of tokens generated
Power draw during generation
Time taken to reason and answer

4. Response With Metadata

Finally, the system answers with a structured output. Because the hardware is monitored directly, the output contains actuals rather than estimates.

Token usage
Energy and carbon estimates
The only assumptions are at the provider level

Applications of the One-Token Model

The One-Token Model can be implemented at multiple layers of an AI system depending on the visibility available into usage, hardware, and model behavior. Broadly, OTM can be applied in three domains:usage-level analytics, hardware-level monitoring, and provider-level benchmarking.

Together, these domains allow OTM to support individual users, enterprises, and system operators in evaluating the computational, economic, and environmental implications of AI workloads.

Case 1: Usage

At the usage level, OTM quantifies how much computational work is performed in response to user interactions and translates this into energy and carbon metrics.

Individual Interactions - AIWattch

For individual users interacting with a chat-based LLM, OTM can measure the per-prompt impact of an inference event using output tokens as the primary unit. Tools built on top of the methodology (such as lightweight instrumentation layers or prompt-side extensions) can help users understand:

How their query structure affects token generation

Whether a prompt is producing unnecessary work

How efficiency varies across different types of tasks (short-form, long-form, reasoning, or multimodal)

This enables end-users to make informed decisions about how they use AI systems.

Enterprise Usage

In organisational contexts, thousands of interactions accumulate into substantial computational footprints. When deployed within enterprise systems, OTM can aggregate usage across teams and roles to provide:

Per-employee and per-department impact

Cumulative token and energy measurements

High-level summaries of environmental impact across the organisation.

This supports internal observability into employee AI usage, compliance requirements, and responsible-use governance.

Case 2: Hardware

OTM connects usage to physical activity when hardware telemetry is available. Using GPU and server-level metrics, the model translates observed compute activity into:

This produces a direct mapping between computational work and environmental footprint, and gives a clear tokens/Co2 measurement in real-time across GPUs.

Case 3: Provider

Because API pricing is defined per token and each provider tokenizes differently, OTM enables consistent cross-model comparisons. By linking token generation to cost and computational effort, OTM allows providers and consumers to evaluate:

Pricing per million tokens
Effective throughput once normalized across tokenizers
Response latency and reasoning time
Differences between input and output tokenization
Energy and carbon per output token under comparable workloads
Tokenization styles and resulting cost and energy implication
Relative efficiency between models
Relationship between model architecture and compute intensity

OTM helps you compare AI models on a consistent, real-time basis. So any claims of being a more sustainable AI provider can be easily validated in real-time with the OTM.

Optimization Using the One-Token Model

The application of the One-Token Model across usage, hardware, and provider layers enables organisations not only to measure impact but to translate those measurements into operational, economic, and environmental improvements. By exposing how efficiently tokens are produced, and used, whether in API-driven applications or self-hosted deployments, OTM provides the observability needed to guide optimization strategies. These strategies typically centre on three outcomes: cost reduction, performance efficiency, and emissions minimization.

Case Study: Improving Inference Efficiency in an Enterprise Deployment

A mid-sized organisation integrates an LLM-based assistant into its internal analytics platform. After an initial period of adoption, the engineering team observes that inference-related cloud costs and GPU activity are increasing at a rate disproportionate to the growth in user queries. To understand the source of the discrepancy, the organisation deploys the One-Token Model with the AT API, to monitor how much computational work is being performed per token and how efficiently that work is converted into user-visible responses.

Establishing a Baseline

Using OTM instrumentation, and the AT API, the team captures real-time metrics such as:

Throughput (tokens per second)

Latency across the inference interval

GPU utilization and power draw

These measurements reveal the energy cost per token and highlight variations in efficiency across workloads and times of day. This baseline becomes the reference point for targeted interventions.

Targeted Optimization

Targeted Optimization with a clear view of the computation associated with each token, the organisation implements improvements along three dimensions:

Cost Reduction

Analysis shows that the system consumes approximately 0.002 kWh per output token. By adjusting model configurations, introducing modest batching during peak periods, and refining prompt structures to reduce unnecessary generation, the team reduces this to 0.0015 kWh per token. The improvement translates into a 25% reduction in monthly GPU-related energy expenditure.

Operational Efficiency

OTM reveals that certain GPUs deliver significantly better performance-per-watt ratios for the same workload. The inference scheduler is updated to route requests dynamically toward the most efficient hardware, increasing effective throughput by roughly 12% and improving request latency without additional compute. At the usage level, better prompt engineering, and API usage is driven by insights provided by the OTM.

Emissions Reduction

Lower power draw during inference allows the system to scale down inactive GPUs during off-peak hours. When combined with the cloud provider’s PUE characteristics, this reduces quarterly emissions by approximately 125 kg CO₂e. The reduction results not from offsetting but from structural efficiency gains .

Conclusion

As AI becomes more deeply embedded in products, workflows, and infrastructure, organisations need transparent and consistent ways to understand the computational, economic, and environmental consequences of their AI usage. The One-Token Model responds to this need by grounding measurement in the most consequential unit of inference, tokens, and linking that unit directly to the hardware activity, and provider architectures that drive energy consumption and emissions.

This whitepaper represents an effort in consolidation. We have synthesized insights from fragmented research, benchmarks, and disparate methodologies into a unified, systematic framework. By combining hardware-aware estimation with a normalized representation of token-level compute, the OTM establishes a common analytical layer for evaluating AI workloads across open-source, hybrid, and proprietary environments. This gives organisations a clearer basis for decisions related to model procurement, budgeting, capacity planning, and sustainability reporting.

At Antarctica, our core value is bringing radical transparency and measurable value to every company deploying AI. We believe that sustainability, FinOps, and operational efficiency are not opposing goals but shared outcomes. Our mission with the OTM is to make Sustainable AI actionable inside organizations, ensuring that the growing body of research is not merely theoretical, but is implemented to drive tangible impact.

When deployed in production systems, the OTM supports operational optimisation: reducing cost, improving performance efficiency, and lowering emissions without constraining capability. The OTM is a living standard. We are already actively developing Version 1.1, which addresses the limitations identified in this paper to provide an even more granular view of inference. V1.1 will expand the energy envelope to strictly account for CPU, RAM, and networking overhead, and will fully integrate the Antarctica Token, our normalization layer designed to make cross-provider benchmarking consistent.

The need for reliable, cross-provider measurement standards will only grow as AI becomes more ubiquitous and heterogeneous. We are now moving from consolidation to application. The One-Token Model represents a step toward that standardisation by aligning operational clarity with environmental responsibility, offering a practical and scientifically grounded path to understanding how modern AI systems consume resources and deliver value.

Sources

Scroll To Top