mirza.town - end of the typewriter

_^16/06/2026

Jumpscare

\[ H(X) = -\sum_{i=1}^{N} p(x_i) \log_2 p(x_i) \]

where p(x_i) is the probability of the ith token and N is the number of tokens in the block. But first…

End of the Typewriter

A year and a half ago, I sent this message to my team lead:

“Transformers generate 1 token at a time, right? What if they generated N tokens simultaneously like a diffusion model, and then refined them—wouldn’t that produce better results?”

Fast forward to this week, and Google has officially validated that exact thought experiment with the release of DiffusionGemma. It turns out the idea wasn’t just a late-night delusion. (Though I am still waiting on my royalty check in the mail.)

How does DiffusionGemma work?

Imagine you have a blank page in front of you. You start writing, and from the get-go, you know the introduction, the body, and the conclusion. You vaguely know the general structure of the essay. First, you draft, then you refine.

For years, the most powerful language models in the world have basically been hyper-advanced typewriters. They generate text autoregressively, predicting the next word one token at a time, strictly sequentially. This differs profoundly from how a human writes (or at least the thinking behind the human writing process). Writing with a standard LLM is like writing an essay in permanent marker: you can’t go back and fix your introduction once you realize where your conclusion naturally landed.

DiffusionGemma solves this by taking a canvas of 256 tokens of absolute gibberish and evaluating the entire chunk at once. It iteratively sculpts the whole block into a coherent paragraph simultaneously (one step at a time, of course). In essence, it can fix the beginning based on the end, and adjust the middle based on the surrounding context using bidirectional attention.

So if the starting context was: Today it will rain, so I...

and we want to generate a short block of tokens, our initial (first step) output and the confidence scores might look like this:

Token	Confidence
[fridge]	0.04
[stay]	0.82
[guitar]	0.02
[clown]	0.01
[read]	0.15
[rose]	0.02
[cloud]	0.04
[.]	0.10
[<EOS>]¹	0.65
[algorithm]	0.01

Even though it completely whiffed on position 1 with [fridge], the bidirectional attention found a solid anchor at position 2, realizing [stay] logically belongs somewhere right after the prompt. Interestingly, it also found a high-confidence guess at position 9 ([<EOS>]), sensing the sentence should end soon.

Even though [guitar] was a terrible guess for position 3, the model still generated a probability distribution for all the words in its vocabulary for that slot. Maybe “inside” only had a 5% chance in step 1, but that is still higher than the 0.0001% chance it gave to “xylophone.”

Hey, where was I?

The clever part is this: if we throw away the confidence scores for the uncertain positions, we basically waste crucial information. Instead, DiffusionGemma uses self-conditioning. It takes the full probability distribution from the previous step, converts it into a weighted average of all token embeddings, and adds it—through a small gated MLP—onto the canvas embeddings before the next pass.

This gives each step a “memory” of what the model believed last time. So, even positions that get re-noised to random tokens carry forward information from the previous step rather than having to start from scratch.

Simultaneously, DiffusionGemma uses an entropy-bound rule. It walks through positions from most confident to least, accepting tokens until their accumulated entropy exceeds a fixed budget.

Token	Confidence	Status
[fridge]	0.04	Re-noised
[stay]	0.82	Locked
[inside]	0.45	Re-noised
[and]	0.30	Re-noised
[read]	0.25	Re-noised
[a]	0.20	Re-noised
[book]	0.15	Re-noised
[.]	0.40	Re-noised
[<EOS>]	0.65	Locked
[<PAD>]²	0.05	Re-noised

Now we are getting somewhere. The model is highly confident about “stay” and its End-of-Sequence ([<EOS>]) marker. These two high-confidence tokens fit within the budget, so they become anchors. The rest of the tokens—including the bizarre guess of [fridge] at position 1—are still too uncertain, so they get re-noised once more.

After a few more steps:

Token	Confidence	Status
[will]	0.99	New Lock
[stay]	0.98	Locked
[inside]	0.95	New Lock
[and]	0.92	New Lock
[read]	0.88	New Lock
[a]	0.85	New Lock
[book]	0.80	New Lock
[.]	0.85	New Lock
[<EOS>]	0.99	Locked

The model finally replaced the nonsense at position 1 with [will], bringing the whole thought together: “Today it will rain, so I will stay inside and read a book.” The argmax predictions have stabilized, and the mean entropy is incredibly low. The model has reached convergence.

Cost: Sequential vs. Parallel

We can clearly see the absolute worst-case scenario for diffusion models is exactly the same as that of sequential autoregressive models. For 256 tokens, a standard model does 256 forward passes. If the diffusion model were only able to lock in 1 good token per step, it would also take 256 steps. But because the model propagates context bidirectionally and locks in multiple confident tokens per step, the number of required passes drops dramatically.

Each denoising step runs the Gemma backbone³ in decoder mode over the full canvas, samples a candidate token at every position, and decides which positions to keep.

Because all 256 positions are evaluated in parallel, generating 1 token or 256 tokens takes essentially the same amount of time per step. It effectively trades memory bandwidth pressure for additional compute—which is an incredibly smart tradeoff at low batch sizes where compute is plentiful.

This is fundamentally different from Multi-Token Prediction (MTP), which the standard Gemma 4 can use. According to the vLLM blog post, at batch size 1, the FP8 DiffusionGemma model reaches an absurd 1,288 generation tokens per second on an H200. That is 6 times faster than a standard autoregressive Gemma 4 baseline, and 3 times faster than one using MTP⁴ mode.

Nuts.

Conclusion

Keen eyes might have noticed that a 256-token canvas won’t let you one-shot a full novel, but it’s still a fascinating architectural shift. I believe this approach will get even better in terms of speed and quality with larger canvas sizes. I doubt it will have much of an impact on problem-solving tasks. Just my opinion though. :^)

Oh, and yeah. If this takes off, say goodbye to streaming modes and UIs for LLMs.

Notes

It was initially hard for me to wrap my head around how the model could select a better token. In hindsight, it’s obvious. It’s embeddings all the way down.

Next up, I plan on writing a post breaking down Multi-Token Prediction (MTP).

P.S: Google is not the only company working on this; there’s also inceptionlabs.ai. Karpathy mentioned them in February of 2025 in a tweet, ages ago. :^)

End of sentence token.↩︎
Padding token.↩︎
Decoder mode, bidirectional. Predicts next logits.↩︎
Multi-Token Prediction mode.↩︎