Making Tokens, Pt. 5: The Forward Pass

The final piece of the Making Tokens series. We have built the chain. Sand to wafer to chip to cluster. The chain ends here, with the user pressing enter and the model producing the next word.

What "one token" actually is

A token is a unit of text the model operates on. It is not a word. The English tokenizer used by most large models breaks text into subword units, so common words ("the", "and") are single tokens while uncommon words ("photolithography") might be split into three or four.

Useful rules of thumb:

1 token ≈ 4 English characters ≈ 0.75 words
A 500-word page is about 666 tokens
A typical conversational chat response is 200-1000 tokens
A long-form essay this length is around 3000 tokens

When a model "produces a token," what is happening at the silicon level is one forward pass through the trained neural network. Inputs go in (your prompt, plus all tokens generated so far), one new token comes out, then the whole thing repeats with that token appended to the input. The autoregressive loop is the dominant computational pattern of modern transformers.

The arithmetic

For a transformer of N parameters, generating one token requires roughly 2N floating-point operations. The factor of 2 is the standard back-of-envelope: one multiply and one add per parameter, per token, for a dense model. The arithmetic gets more nuanced for mixture-of-experts (MoE) models and for prompt processing (which is more parallelizable than generation), but the 2N rule is good enough for most purposes.

For some concrete numbers:

70B-parameter dense model: ~140 GFLOPs per token (2 × 70 × 10⁹)
400B-parameter dense model: ~800 GFLOPs per token
MoE model with 8 experts but 2 active per token: roughly the active-parameter compute, so a "1T total, 100B active" MoE is about 200 GFLOPs per token

An H100 SXM has a theoretical peak of about 989 TFLOPs of FP16 compute (more with FP8, less with FP32). In practice, for autoregressive decoding (one token at a time, batch size 1), you cannot saturate the chip because the workload is memory-bandwidth-bound, not compute-bound. You can move much closer to peak when you batch multiple users together, which is why every production inference service is doing aggressive request batching under the hood.

A typical production deployment of a 70B-class model on a single H100, with reasonable batching, delivers somewhere in the range of 3,000-8,000 tokens per second of aggregate throughput across all concurrent users. Per individual user, you typically see 50-150 tokens per second of streaming output, depending on the load and the model.

The energy bill, per token

If we want to know what a single token actually costs in energy terms, the math is:

An H100 draws ~700 W under load
It produces somewhere around 5,000 tok/sec at batch
That's 5,000 tok/sec ÷ 700 W = 7.1 tokens per joule
Equivalently, ~0.14 J per token, or ~0.04 milliwatt-hours per token

For a chat response of 500 tokens:

~70 J of GPU energy = roughly 0.02 watt-hours
Including facility overhead (PUE ~1.3), call it ~0.025 Wh per response

A Google search is often cited at around 0.3 Wh per query. So a typical AI chat response is, surprisingly to many people, roughly an order of magnitude cheaper in energy than a Google search, on a per-query basis. The popular framing of "AI uses so much energy" is mostly a story about the upstream training (where the energy cost is real and concentrated) and the aggregate scale (trillions of tokens per day), not the per-query cost.

The aggregate scale, of course, is the part that adds up. If a single AI service handles a billion queries per day at 500 tokens each, that's 500 billion tokens per day, or ~12.5 GWh per year of inference electricity for one service. Multiply across the industry and you get to the headline numbers about hyperscaler power demand growing several-fold this decade.

What a token costs in dollars

The user-facing price of a token in 2026 spans roughly three orders of magnitude depending on the model:

| Model class | Input $/Mtok | Output $/Mtok | |---|---|---| | Cheapest commodity (Gemini Flash, gpt-4o-mini) | $0.15-0.30 | $0.60-2.50 | | Frontier mainstream (Claude Sonnet, gpt-4o, Gemini Pro) | $1-3 | $5-15 | | Reasoning models (o1, o3) | $1-15 | $4-60 |

The reasoning models are expensive specifically because they emit a lot of invisible "thinking" tokens that the user doesn't see but is billed for. A single visible answer from a reasoning model might consume 20-100x as many tokens internally as the user sees.

Output tokens are universally more expensive than input tokens. The reason is that input tokens can be processed in parallel (the prompt processing step is compute-bound and runs efficiently), while output tokens have to be generated one at a time in the autoregressive loop, which is memory-bandwidth-bound and harder to optimize. The 3-5x output multiplier you see on every pricing page is real and structural.

Margin is hard to estimate from the outside. Cheapest commodity models are likely close to cost. Frontier mainstream models are probably running at healthy gross margins. The big variable is idle GPU time: a model with low utilization is much less profitable per active token than one running at saturation.

Back to the whole chain

Let me close by putting all of this in one place. The full supply chain that produces a single token, with rough order-of-magnitude numbers per token:

| Stage | Per-token rough cost | |---|---| | Silicon (polysilicon, wafer, fab depreciation share) | ~$10⁻⁹ ($) | | Datacenter capex amortization (chip + facility) | ~$10⁻⁶ ($) | | Training amortization (across all queries the model serves) | ~$10⁻⁸ ($) | | Direct inference electricity | ~$10⁻⁸ ($) | | Total marginal cost (rough) | ~$10⁻⁶ per output token |

The user-facing price of $1-15 per million output tokens for frontier models is therefore roughly $10⁻⁶ to $10⁻⁵ per token, which is in the same order of magnitude as our marginal cost estimate. That's not a coincidence: it tells you the gross margin in this business is not as fat as people sometimes assume, especially for the frontier models that justify the largest training investments.

The point of all of this

The reason I wrote this series is that the AI discourse is mostly about the model. It is rarely about the supply chain. And the supply chain is far more constrained, far more capital-intensive, and far more geographically concentrated than the public conversation reflects.

When you generate a token, you are touching:

A specific mine in North Carolina or Brazil
One of four wafer companies in Japan, Germany, and Taiwan
One of three foundries that can run leading-edge processes
One company in the Netherlands that makes the lithography machines
One of three or four datacenter operators large enough to deploy frontier clusters
A power grid somewhere with enough spare capacity to support a small city's worth of new demand
A water table somewhere with enough margin to support the cooling

Every single token routes through that whole chain. Each one costs a tiny share of all of it. The cost is amortized across so many tokens that each individual one feels free.

Building well in the AI era means understanding where each layer of that chain bends, breaks, or surprises. If you got something out of this series and want to talk about that work, I am around.